deterministic stats_mode in Oracle - sql

In Oracle, stats_mode function selects the mode of a set of data. Unfortunately, it is non-deterministic in picking it's result in the presence of ties (e.g. stats_mode(1,2,1,2) could return 1 or 2 depending on the ordering of rows inside Oracle. In many situations this is not acceptable. Is there a function or nice technique for being able to supply your own deterministic ordering for stats_mode function?

Oracle's web-page on STATS_MODE explains that If more than one mode exists, Oracle Database chooses one and returns only that one value.
As there are no additional parameters, etc, you can not change it's behaviour.
The same page, however, does also show that the following sample query can generate multiple mode values...
SELECT x FROM (SELECT x, COUNT(x) AS cnt1 FROM t GROUP BY x)
WHERE cnt1 = (SELECT MAX(cnt2) FROM (SELECT COUNT(x) AS cnt2 FROM t GROUP BY x));
By modifying such code you could once again just choose a single value, as determined by a specified ORDER...
SELECT x FROM (SELECT x, MAX(y) AS y, COUNT(x) AS cnt1 FROM t GROUP BY x)
WHERE cnt1 = (SELECT MAX(cnt2) FROM (SELECT COUNT(x) AS cnt2 FROM t GROUP BY x))
AND rownum = 1
ORDER BY y DESC;
A bit messy, unfortunately, though you may be able to tidy it slightly for your particular case. But I'm not aware of alternative fundamentally different approaches.

Selecting the value among a set of values with the highest occurring frequency could also be done by counting and ordering.
select x from t group by x order by count(*) desc limit 1;
You can also make it deterministic by ordering on the value itself.
select x from t group by x order by count(*) desc, x desc limit 1;
I don't quite understand the complexity of Oracles query examples, the performance is really bad. Can anyone shine some light on the difference?

Related

Does Big Query support custom sorting?

I am trying to sort data by applying case when statement in the order by clause but looks like Big Query doesn't support even though it worked fine in other SQL environments. Can somebody share your thoughts on this.
Update (2021) - Bigquery now does support ORDER BY with expressions, e.g.
SELECT event_type, COUNT(*) as event_count
FROM events
GROUP BY event
ORDER BY (
CASE WHEN event='generated' THEN 1
WHEN event='sent' THEN 2
WHEN event='paid' THEN 3
ELSE 4
END
)
select x
from (
select x ,
case when x = 'a' then 'z' else x end as y
from
(select 'a' as x),
(select 'b' as x),
(select 'c' as x),
(select 'd' as x)
)
order by y desc
I think the documentation is pretty clear:
ORDER BY clause
... ORDER BY field1|alias1 [DESC|ASC], field2|alias2 [DESC|ASC] ...
The ORDER BY clause sorts the results of a query in ascending or
descending order of one or more fields. Use DESC (descending) or ASC
(ascending) to specify the sort direction. ASC is the default.
You can sort by field names or by aliases from the SELECT clause. To
sort by multiple fields or aliases, enter them as a comma-separated
list. The results are sorted on the fields in the order in which they
are listed.
So, BigQuery doesn't allow expressions in the ORDER BY. However, you can include the expression in the SELECT and then refer to it by the alias. So, BigQuery does support "custom sorting", but only by expressions in the SELECT.
Interestingly, Hive has a similar limitation.

Infinite Scroll with shuffle results

How do I return random results that do not repeat?
For example, I've an infinite scrolling page, every time I get to the bottom it returns ten results, but sometimes the results are repeated.
I'm using this query to get results:
SELECT TOP 10 * FROM table_name ORDER BY NEWID()
Sorry, I don't know if you'll understand.
When you call the query from your application you set the seed for the RAND() function.
SET #rand = RAND(your_seed); -- initialize RAND with the seed.
SELECT * FROM table_name
ORDER BY RAND() -- Calls to RAND should now be based on the seed
OFFSET 0 LIMIT 10 -- use some MsSQL equivalent here ;)
(not tested)
Apparently, NEWID() has known distributional problems. Although random, the numbers sometimes cluster together. This would account for what you are seeing. You could try this:
SELECT TOP 10 *
FROM table_name
ORDER BY rand(checksum(NEWID()));
This may give you a better results.
The real answer, though, is to use a seeded pseudo-random number generator. Basically, enumerate the rows of the table and store the value in the table. Or calculate it in a deterministic way. Then do simple math to choose a row:
with t as (
select t.*, row_number() over (order by id) as seqnum,
count(*) over () as cnt
from table_name
)
select t.*
from t
where mod(seqnum * 74873, cnt) = 13907;
The numbers are just two prime numbers, which ensure a lack of cycles.
EDIT:
Here is a more complete solution to your problem:
with t as (
select t.*, row_number() over (order by id) as seqnum,
count(*) over () as cnt
from table_name
)
select t.*
from t
where mod(seqnum * 74873 + 13907, cnt) <= 10;
Or whatever the limits are. The idea is that using a large prime number for the multiplicative factor makes it highly likely (but not 100% certain) that that cnt and "74783" are what is called "mutually prime" or "coprime". This means that the pseudo-random number generator just described will rearrange the sequence numbers and you can just use comparisons to get a certain number of rows. This is part of mathematics called Number Theory.

Select finishes where athlete didn't finish first for the past 3 events

Suppose I have a database of athletic meeting results with a schema as follows
DATE,NAME,FINISH_POS
I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data
2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1
The following rows:
2013-06-20,Johnson,4
2013-06-19,Johnson,2
Would be matched. I have only managed to get started at the following stub:
select date,name FROM table WHERE ...;
I've been trying to wrap my head around the where clause but I can't even get a start
I think this can be even simpler / faster:
SELECT day, place, athlete
FROM (
SELECT *, min(place) OVER (PARTITION BY athlete
ORDER BY day
ROWS 3 PRECEDING) AS best
FROM t
) sub
WHERE best > 1
->SQLfiddle
Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.
Details about window function calls in the manual here. In particular:
If frame_end is omitted it defaults to CURRENT ROW.
If place (finishing_pos) can be NULL, use this instead:
WHERE best IS DISTINCT FROM 1
min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.
Don't use type names and reserved words as identifiers, I substituted day for your date.
This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.
#Craig already mentioned the index to make this fast.
Here's an alternative formulation that does the work in two scans without subqueries:
SELECT
"date", athlete, place
FROM (
SELECT
"date",
place,
athlete,
1 <> ALL (array_agg(place) OVER w) AS include_row
FROM Table1
WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;
See: http://sqlfiddle.com/#!1/fa3a4/34
The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.
Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.
I'd be really interested in the relative efficiency of this vs #Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.
; with CTE as
(
select row_number() over (partition by athlete order by date) rn
, *
from Table1
)
select *
from CTE cur
where not exists
(
select *
from CTE prev
where prev.place = 1
and prev.athlete = cur.athlete
and prev.rn between cur.rn - 3 and cur.rn
)
Live example at SQL Fiddle.

Selecting latest of rows with the same values in several columns

In a project I am working on there are measurements stored in a database. A measurement consists of a worldcoordinate (posX, posY, posZ) a station identification number (stationID) and a time for measurement (time).
Sometimes a measurement is redone in the field for different reasons and then there are several measurements with the same coordinate and station id but performed at different times.
Is there a way to write an sql query such that I get all VALID measurements ie, only the latest ones in the case where the coordinates and station id are the same?
I am not very adept at SQL so I don't even really know what to google for so any pointers are very much appreciateed even if you only know what type of command I should use :)
EDIT:
My task was just changed, apparently station id does not matter, only coordinates and times.
Also, I am using DISQLite3 that implements SQL-92.
Yes, you can do it in SQL.
It seems you want to take the latest entry for each combination of station and co-ordinates - look at GROUP BY or ROW_NUMBER()
Depending on your SQL variant (It's helpful if you specify it), something like...
select *
from
(Select *,
row_number() over (Partition by coordinates, stationid order by measurementtime desc) rn
from yourtable
) v
where rn = 1
Without Ranking functions
select yourtable.*
from yourtable
inner join
(
select coordinate, MAX(time) maxtime from yourtable
group by coordinate
) v
on yourtable.coordinate = v.coordinate
and yourtable.time = v.maxtime

Distribution of table in time

I have a MySQL table with approximately 3000 rows per user. One of the columns is a datetime field, which is mutable, so the rows aren't in chronological order.
I'd like to visualize the time distribution in a chart, so I need a number of individual datapoints. 20 datapoints would be enough.
I could do this:
select timefield from entries where uid = ? order by timefield;
and look at every 150th row.
Or I could do 20 separate queries and use limit 1 and offset.
But there must be a more efficient solution...
Michal Sznajder almost had it, but you can't use column aliases in a WHERE clause in SQL. So you have to wrap it as a derived table. I tried this and it returns 20 rows:
SELECT * FROM (
SELECT #rownum:=#rownum+1 AS rownum, e.*
FROM (SELECT #rownum := 0) r, entries e) AS e2
WHERE uid = ? AND rownum % 150 = 0;
Something like this came to my mind
select #rownum:=#rownum+1 rownum, entries.*
from (select #rownum:=0) r, entries
where uid = ? and rownum % 150 = 0
I don't have MySQL at my hand but maybe this will help ...
As far as visualization, I know this is not the periodic sampling you are talking about, but I would look at all the rows for a user and choose an interval bucket, SUM within the buckets and show on a bar graph or similar. This would show a real "distribution", since many occurrences within a time frame may be significant.
SELECT DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket -- choose an appropriate granularity (days used here)
,COUNT(*)
FROM entries
WHERE uid = ?
GROUP BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
ORDER BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
Or if you don't like the way you have to repeat yourself - or if you are playing with different buckets and want to analyze across many users in 3-D (measure in Z against x, y uid, bucket):
SELECT uid
,bucket
,COUNT(*) AS measure
FROM (
SELECT uid
,DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket
FROM entries
) AS buckets
GROUP BY uid
,bucket
ORDER BY uid
,bucket
If I wanted to plot in 3-D, I would probably determine a way to order users according to some meaningful overall metric for the user.
#Michal
For whatever reason, your example only works when the where #recnum uses a less than operator. I think when the where filters out a row, the rownum doesn't get incremented, and it can't match anything else.
If the original table has an auto incremented id column, and rows were inserted in chronological order, then this should work:
select timefield from entries
where uid = ? and id % 150 = 0 order by timefield;
Of course that doesn't work if there is no correlation between the id and the timefield, unless you don't actually care about getting evenly spaced timefields, just 20 random ones.
Do you really care about the individual data points? Or will using the statistical aggregate functions on the day number instead suffice to tell you what you wish to know?
AVG
STDDEV_POP
VARIANCE
TO_DAYS
select timefield
from entries
where rand() = .01 --will return 1% of rows adjust as needed.
Not a mysql expert so I'm not sure how rand() operates in this environment.
For my reference - and for those using postgres - Postgres 9.4 will have ordered set aggregates that should solve this problem:
SELECT percentile_disc(0.95)
WITHIN GROUP (ORDER BY response_time)
FROM pageviews;
Source: http://www.craigkerstiens.com/2014/02/02/Examining-PostgreSQL-9.4/