Infinite Scroll with shuffle results - sql

How do I return random results that do not repeat?
For example, I've an infinite scrolling page, every time I get to the bottom it returns ten results, but sometimes the results are repeated.
I'm using this query to get results:
SELECT TOP 10 * FROM table_name ORDER BY NEWID()
Sorry, I don't know if you'll understand.

When you call the query from your application you set the seed for the RAND() function.
SET #rand = RAND(your_seed); -- initialize RAND with the seed.
SELECT * FROM table_name
ORDER BY RAND() -- Calls to RAND should now be based on the seed
OFFSET 0 LIMIT 10 -- use some MsSQL equivalent here ;)
(not tested)

Apparently, NEWID() has known distributional problems. Although random, the numbers sometimes cluster together. This would account for what you are seeing. You could try this:
SELECT TOP 10 *
FROM table_name
ORDER BY rand(checksum(NEWID()));
This may give you a better results.
The real answer, though, is to use a seeded pseudo-random number generator. Basically, enumerate the rows of the table and store the value in the table. Or calculate it in a deterministic way. Then do simple math to choose a row:
with t as (
select t.*, row_number() over (order by id) as seqnum,
count(*) over () as cnt
from table_name
)
select t.*
from t
where mod(seqnum * 74873, cnt) = 13907;
The numbers are just two prime numbers, which ensure a lack of cycles.
EDIT:
Here is a more complete solution to your problem:
with t as (
select t.*, row_number() over (order by id) as seqnum,
count(*) over () as cnt
from table_name
)
select t.*
from t
where mod(seqnum * 74873 + 13907, cnt) <= 10;
Or whatever the limits are. The idea is that using a large prime number for the multiplicative factor makes it highly likely (but not 100% certain) that that cnt and "74783" are what is called "mutually prime" or "coprime". This means that the pseudo-random number generator just described will rearrange the sequence numbers and you can just use comparisons to get a certain number of rows. This is part of mathematics called Number Theory.

Related

Google Bigquery Memory error when using ROW_NUMBER() on large table - ways to replace long hash by short unique identifier

For a query in google BigQuery I want to replace a long hash by a shorter numeric unique identifier to save some memory afterwards, so I do:
SELECT
my_hash
, ROW_NUMBER() OVER (ORDER BY null) AS id_numeric
FROM hash_table_raw
GROUP BY my_hash
I don't even need an order in the id, but ROW_NUMBER() requires an ORDER BY.
When I try this on my dataset (> 1 billion rows) I get a memory error:
400 Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 126% of limit.
Top memory consumer(s):
sort operations used for analytic OVER() clauses: 99%
other/unattributed: 1%
Is there another way to replace a hash by an shorter identifier?
Thanks!
One does not really need to have populated over clause while doing this.
e.g. following will work:
select col, row_number() over() as row_num from (select 'A' as col)
So that will be your first try.
Now, with billion+ rows that you have: if above fails: you can do something like this (considering order is not at all important for you): but here you have to do it in parts:
SELECT
my_hash
, ROW_NUMBER() OVER () AS id_numeric
FROM hash_table_raw
where MOD(my_hash, 5) = 0
And in subsequent queries:
you can get max(id_numeric) from previous run and add that as an offset to next:
SELECT
my_hash
, previous_max_id_numberic_val + ROW_NUMBER() OVER () AS id_numeric
FROM hash_table_raw
where MOD(my_hash, 5) = 1
And keep appending outputs of these mod queries (0-4) to a single new table.

offset count(*) / N columns

I do some job in few machines.
Need divide rows into equal parts for each machine.
Query with all conditions performed long time.
Is it possible to do this without additional request count(*), in one query.
Some like this:
select * from some_table
where some_rows = some_values
offset (count(*) / count_machines) * (machine_number - 1)
limit count(*) / count_machines
What I sometime do to split data into blocks is a simple modulo on some id:
select * from some_table
where some_rows = some_values
and mod(id, BLOCKS) = BLOCK - 1;
Replace BLOCKS with the number of machines and BLOCK with a number n from 1 to BLOCKS to retrieve block n.
(You can even speed this up with a function index on the modulo expression provided the number of machines is static and you need this more often.)
If you want the same number of rows for each machine, then use row_number():
select t.*
from (select t.*,
row_number() over (partition by machine order by machine) as seqnum
from some_table t
where . . .
) t
where seqnum <= 10;
Don't reinvent the wheel.
Use hash function on value(s) of the record and then modulo by the number of machines in order to get the machine which will hold the record.
If you'll choose the value(s) well you'll get good distribution, good performance and you'll know where is each record located.

Select finishes where athlete didn't finish first for the past 3 events

Suppose I have a database of athletic meeting results with a schema as follows
DATE,NAME,FINISH_POS
I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data
2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1
The following rows:
2013-06-20,Johnson,4
2013-06-19,Johnson,2
Would be matched. I have only managed to get started at the following stub:
select date,name FROM table WHERE ...;
I've been trying to wrap my head around the where clause but I can't even get a start
I think this can be even simpler / faster:
SELECT day, place, athlete
FROM (
SELECT *, min(place) OVER (PARTITION BY athlete
ORDER BY day
ROWS 3 PRECEDING) AS best
FROM t
) sub
WHERE best > 1
->SQLfiddle
Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.
Details about window function calls in the manual here. In particular:
If frame_end is omitted it defaults to CURRENT ROW.
If place (finishing_pos) can be NULL, use this instead:
WHERE best IS DISTINCT FROM 1
min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.
Don't use type names and reserved words as identifiers, I substituted day for your date.
This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.
#Craig already mentioned the index to make this fast.
Here's an alternative formulation that does the work in two scans without subqueries:
SELECT
"date", athlete, place
FROM (
SELECT
"date",
place,
athlete,
1 <> ALL (array_agg(place) OVER w) AS include_row
FROM Table1
WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;
See: http://sqlfiddle.com/#!1/fa3a4/34
The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.
Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.
I'd be really interested in the relative efficiency of this vs #Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.
; with CTE as
(
select row_number() over (partition by athlete order by date) rn
, *
from Table1
)
select *
from CTE cur
where not exists
(
select *
from CTE prev
where prev.place = 1
and prev.athlete = cur.athlete
and prev.rn between cur.rn - 3 and cur.rn
)
Live example at SQL Fiddle.

SQL Server : how to select a fixed amount of rows (select every x-th value)

A short description: I have a table with data that is updated over a certain time period. Now the problem is, that - depending on the nature of the sensor which sends the data - in this time period there could be either 50 data sets or 50.000. As I want to visualize this data (using ASP.NET / c#), for a first preview I would like to SELECT just 1000 values from the table.
I already have an approach doing this: I count the rows in the time period of interest, with a simple "where" clause to specify the sensor-id, save it as a variable in SQL, and then divide the count() by 1000. I've tried it in MS Access, where it works just fine:
set #divider = select count(*) from table where [...]
SELECT (Int([RowNumber]/#divider)), First(Value)
FROM myTable
GROUP BY (Int([RowNumber]/#divider));
The trick in Access was, that I simply have a data field ("RowNumber"), which is my PK/ID, and goes from 0 up. I tried to accomplish that in SQL Server using the ROW_NUMBER() method, which works more or less. I've got the right syntax for the method, but I can not use the GROUP BY statement
Windowed functions can only appear in the SELECT or ORDER BY
clauses.
meaning ROW_NUMBER() can't be in the GROUP BY statement.
Now I'm kinda stuck. I've tried to save the ROW_NUMBER value into a char or a separate column, and GROUP BY it later on, but I couldn't get it done. And somehow I start to think, that my strategy might have its weaknesses ...? :/
To clarify once more: I don't need to SELECT TOP 1000 from my table, because this would just mean that I select the first 1000 values (depending on the sorting). I need to SELECT every x-th value, while I can compute the x (and I could even round it to an INT, if that would help to get it done). I hope I was able to describe the problem understandable ...
This is my first post here on StackOverflow, I hope I didn't forget anything essential or important, if you need any further information (table structure, my queries so far, ...) please don't hesitate to ask. Any help or hint is highly appreciated - thanks in advance! :)
Update: SOLUTION! Big thanks to https://stackoverflow.com/users/52598/lieven!!!
Here is how I did it in the end:
I declare 2 variables - I count my rows and SET it into the first var. Then I use ROUND() on the just assigned variable, and divide it by 1000 (because in the end I want ABOUT 1000 values!). I split this operation into 2 variables, because if I used the value from the COUNT function as basis for my ROUND operation, there were some mistakes.
declare #myvar decimal(10,2)
declare #myvar2 decimal(10,2)
set #myvar = (select COUNT(*)
from value_table
where channelid=135 and myDate >= '2011-01-14 22:00:00.000' and myDate <= '2011-02-14 22:00:00.000'
)
set #myvar2 = ROUND(#myvar/1000, 0)
Now I have the rounded value, which I want to be my step-size (take every x-th value -> this is our "x" ;)) stored in #myvar2. Next I will subselect the data of the desired timespan and channel, and add the ROW_NUMBER() as column "rn", and finally add a WHERE-clause to the outer SELECT, where I divide the ROW_NUMBER through #myvar2 - when the modulus is 0, the row will be SELECTed.
select * from
(
select (ROW_NUMBER() over (order by id desc)) as rn, myValue, myDate
from value_table
where channel_id=135 and myDate >= '2011-01-14 22:00:00.000' and myDate<= '2011-02-14 22:00:00.000'
) d
WHERE rn % #myvar2 = 0
Works like a charm - once again all my thanks to https://stackoverflow.com/users/52598/lieven, see the comment below for the original posting!
In essence, all you need to do to select the x-th value is retain all rows where the modulus of the rownumber divided by x is 0.
WHERE rn % #x_thValues = 0
Now to be able to use your ROW_NUMBER's result, you'll need to wrap the entire statement into in a subselect
SELECT *
FROM (
SELECT *
, rn = ROW_NUMBER() OVER (ORDER BY Value)
FROM DummyData
) d
WHERE rn % #x_thValues = 0
Combined with a variable to what x-th values you need, you might use something like this testscript
DECLARE #x_thValues INTEGER = 2
;WITH DummyData AS (SELECT * FROM (VALUES (1), (2), (3), (4)) v (Value))
SELECT *
FROM (
SELECT *
, rn = ROW_NUMBER() OVER (ORDER BY Value)
FROM DummyData
) d
WHERE rn % #x_thValues = 0
One more option to consider:
Select Top 1000 *
From dbo.SomeTable
Where ....
Order By NewID()
but to be honest- like the previous answer more than this one.
The question could be about performance..

Query Optimization

That's my current query, it works but it is slow:
SELECT row, MIN(flg) ||' to ' ||Max (flg) as xyz , avg(amt_won), count(*)
FROM(
SELECT (ROW_NUMBER() OVER (ORDER BY flg))*100/
(SELECT count(*)+100 as temprow FROM temporary_six_max) as row, flg, amt_won
FROM temporary_six_max
JOIN (
SELECT id_player AS pid, avg(flg_vpip::int) AS flg
FROM temporary_six_max
GROUP BY id_player
) AS auxtable
ON pid = id_player
) as auxtable2
group by 1
order by 1;
I am grouping in fixed (or almost fixed) count 100 ranges that are ordered by avg(flg_vpip) grouped by id_player.
Here I've pasted the results in case it may help to understand:
https://spreadsheets0.google.com/ccc?key=tFVsxkWVn4fMWYBxxGYokwQ&authkey=CNDvuOcG&authkey=CNDvuOcG#gid=0
I wonder if there is a better function to use than ROW_NUMBER() in this case and I feel like I am doing too many subselects but I don't know how to optimize it.
I'll appreciate very much any help.
If something is not clear just let me know.
Thank you.
EDIT:
The reason I created auxtable 2, is because when I use (ROW_NUMBER() OVER (ORDER BY flg), and use other agregate commands such as avg(amt_won) and count(*), which are essential, I get an error saying that flg should be in the aggregate function, but I can't order by an aggregate function of flg.
I generated some data to test with like this:
create table temporary_six_max as
select id_player, flg_vpip,
random()*100 * (case flg_vpip when 0 then 1 else -1 end) as amt_won
from (select (random()*1000)::int as id_player, random()::int as flg_vpip
from generate_series(1,1000000)) source;
create index on temporary_six_max(id_player);
Your query runs successfully against that, but doesn't quite generate the same plan, I get a nested loop in the lower arm rather than a merge and a seq scan in the init-plan-- you haven't turned off enable_seqscan I hope?
A solution just using a single scan of the table:
select row, min(flg) || ' to ' || max(flg) as xyz, avg(amt_won), count(*)
from (select flg, amt_won, ntile(100) over(order by flg) as row
from (select id_player as pid, amt_won,
avg(flg_vpip::int) over (partition by id_player) as flg
from temporary_six_max
) player_stats
) chunks
group by 1
order by 1
The bad news is that this actually performs worse on my machine, especially if I bump work_mem up enough to avoid the first disk sort (making player_stats, sorting by flg). Although increasing work_mem did halve the query time, so I guess that is at least a start?
Having said that, my queries are running for about 5 seconds to process 10E6 input rows in temporary_six_max, which is an order of magnitude faster than you posted. Does your table fit into your buffer cache? If not, a single-scan solution may be much better for you. (Which version of Postgresql are you using? "explain (analyze on, buffers on) select..." will show you buffer hit/miss rates in 9.0, or just look at your "shared_buffers" setting and compare with the table size)