Set random seed in bigquery

Set random seed in bigquery - google-bigquery

I've seen these two questions mention the same issue. The answers are almost an year old, and wondering if any more updates were given in BQ - I could not find any concrete answers in the documentation.
I'm trying to do repeated sampling and would like consistent results. This is important for me.
The solution provided in this question, does not provide consistent results.
Here is my code
SELECT
*
FROM (
SELECT
*,
ROW_NUMBER() OVER() as incremental_number
FROM (
SELECT
*
FROM
Table1 as cmd
WHERE
Member NOT IN (
SELECT
Member
FROM
table2
WHERE
Idx = ‘6’
)
) as t
WHERE
MOD(ABS(FARM_FINGERPRINT(TO_JSON_STRING(t))), 10) < 5
)
ORDER BY Member

Maybe just the order of rows is different? Try to sort them.

Related

SQL random sampling into equal groups

I need to randomly sample users in a table into 4 equal groups using SQL from a table. For that I did the below:
First, randomize all users in the table using RANDOM() function, then use the result of it with NTILE() function to divide them into 4 equal halves, like below:
WITH randomised_users AS (
SELECT *
FROM users_table
ORDER BY RANDOM()
) SELECT *,
ntile(4) OVER(ORDER BY (SELECT 1)) AS tile_nr
FROM randomised_users
Is this approach of sampling correct or is there a chance for bias in the 4 groups created from this?

What you have looks fine to me. You don't need a subquery BTW. This will do just fine
select *, ntile(4) over (order by random())
Snowflake doesn't guarantee the query will reproduce the same result set even if you provide a random seed so make sure to dump any intermediate result set into a temp table if you plan on re-using it.

Sum pageviews over array

I'm trying to solve a problem for a couple of days already, but I'm completely stuck:
I do have a basic pageviews table from Snowplow Analytics. I'm creating a session table from that. This table has arrays to structure my data.
Now when I do a sum(count_page_views) the totals are right.
As soon as I add a date dimension date(session_start), the sum for this day is completely wrong.
This is what the table should look like. (Count distinct on pageview id)
This is what it looks like with my session table SQL:
I'm pretty certain I misunderstand something about the way summing arrays and array_length work, but I have no idea, what is wrong...
SQL for session table
with all_page_views as (
select
*
from
`page_views_table`
),
sessions_agg as (
select
pv.session_id,
array_agg(
pv
order by
pv.page_view_in_session_index
) as all_pageviews
from
all_page_views as pv
group by
1
),
sessions_agg_xf as (
select
session_id,
all_pageviews,
(
select
struct(
min(page_view_start) as session_start,
max(page_view_end) as session_end
)
from
unnest(all_pageviews)
) as timing
from
sessions_agg
),
sessions as (
select
timing.session_start,
timing.session_end,
array_length(all_pageviews) as count_page_views
from
sessions_agg_xf
)
select
sum(count_page_views )
from
sessions
where date(session_start) = "2020-02-01"

I believe I've found the problem somewhere else. There was a bug in Snowplow that didn't reset the session id, so my sessionization is wrong...
https://github.com/snowplow/snowplow-javascript-tracker/issues/718

SQL select statement with random number

I'm currently creating some type of quiz website.
My plan: Creating a random number in my sql statement to get a random question from my table.
E.g. every question have a numeric id -> creating random number (max number = records in my table) -> select random question by id
I tried the followed statement:
SELECT *
FROM question
WHERE ID = (SELECT FLOOR(RAND() * (SELECT COUNT(ID) FROM question)) + 1);
My problem:
Sometimes I got no result, sometimes I got two results and sometimes it worked as planed.
If I try the SELECT FLOOR etc. on it's own, the random number works perfectly.
Any suggestions? Thank you in advance my friends!

If you don't have very many questions, you can simply do:
select q.*
from question q
order by rand()
limit 1;
If you have lots of questions, then reducing the number is important for performance. Something like:
select q.*
from question q cross join
(select count(*) from q) qq
where rand() < 100 / q -- get a sample of about 100
order by rand()
limit 1;

Exclude specific column from result in SQL Server [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
SQL exclude a column using SELECT * [except columnA] FROM tableA?
I have following query and I want to exclude the column RowNum from the result, how can I do it ?
SELECT *
FROM
(SELECT
ROW_NUMBER() OVER ( ORDER BY [Report].[dbo].[Reflow].ReflowID ) AS RowNum, *
FROM
[Report].[dbo].[Reflow]
WHERE
[Report].[dbo].[Reflow].ReflowProcessID = 2) AS RowConstrainedResult
WHERE
RowNum >= 100 AND RowNum < 120
ORDER BY
RowNum
Thanks.

It's considered bad practice to not specify column names in your query.
You could push the data into a #temp table, then ALTER the columns in that #temp to DROP a COLUMN, then SELECT * FROM #temp.
This would be inefficent, but it will get you the result you are asking for. By default though, it's best to get into the way of specifying all the columns you require. If someone ALTERs your initial table, even using the push #temp method above, you'll end up with different columns.

Do not use * but give the field lsit you are interested in. That simple. Using a "*" is bad practice anyawy as the order is not defined.

Because you want to order the results based on RowNum's values, you can not exclude this column from your results. You can save the result of your query in a temp table and then make another query on temp table and mention the columns that you want to show in the results(instead of select *). Such an approach will show all columns except RowNum which are ordered based on RowNum's values.

This should work, I dont know the names of your columns so used generic names. Try not to use * its considered bad practice, makes it difficult for people to read your code.
SELECT [column1],
[column2],
[etcetc]
FROM ( SELECT ROW_NUMBER() OVER(ORDER BY RowConstrainedResult.RowNum) [RN],
*
FROM ( SELECT ROW_NUMBER() OVER ( ORDER BY [Report].[dbo].[Reflow].ReflowID ) AS RowNum, *
FROM [Report].[dbo].[Reflow]
WHERE [Report].[dbo].[Reflow].ReflowProcessID = 2
) AS RowConstrainedResult
WHERE RowNum >= 100
AND RowNum < 120

SQL Server rand() aggregate

Problem: a table of coordinate lat/lngs. Two rows can potentially have the same coordinate. We want a query that returns a set of rows with unique coordinates (within the returned set). Note that distinct is not usable because I need to return the id column which is, by definition, distinct. This sort of works (#maxcount is the number of rows we need, intid is a unique int id column):
select top (#maxcount) max(intid)
from Documents d
group by d.geoLng, d.geoLat
It will always return the same row for a given coordinate unfortunately, which is bit of a shame for my use. If only we had a rand() aggregate we could use instead of max()... Note that you can't use max() with guids created by newid().
Any ideas?
(there's some more background here, if you're interested: http://www.itu.dk/~friism/blog/?p=121)
UPDATE: Full solution here

You might be able to use a CTE for this with the ROW_NUMBER function across lat and long and then use rand() against that. Something like:
WITH cte AS
(
SELECT
intID,
ROW_NUMBER() OVER
(
PARTITION BY geoLat, geoLng
ORDER BY NEWID()
) AS row_num,
COUNT(intID) OVER (PARTITION BY geoLat, geoLng) AS TotalCount
FROM
dbo.Documents
)
SELECT TOP (#maxcount)
intID, RAND(intID)
FROM
cte
WHERE
row_num = 1 + FLOOR(RAND() * TotalCount)
This will always return the first sets of lat and lngs and I haven't been able to make the order random. Maybe someone can continue on with this approach. It will give you a random row within the matching lat and lng combinations though.
If I have more time later I'll try to get around that last obstacle.

this doesn't work for you?
select top (#maxcount) *
from
(
select max(intid) as id from Documents d group by d.geoLng, d.geoLat
) t
order by newid()

Where did you get the idea that DISTINCT only works on one column? Anyway, you could also use a GROUP BY clause.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Set random seed in bigquery - google-bigquery

Maybe just the order of rows is different? Try to sort them.

Related

SQL random sampling into equal groups

Sum pageviews over array

SQL select statement with random number

Exclude specific column from result in SQL Server [duplicate]

SQL Server rand() aggregate

Categories

Resources