N random samples from an array in bigquery

N random samples from an array in bigquery - google-bigquery

is it possible to get n random samples from an array
e.g. a table has two columns, id STRING, and values ARRAY(STRING)
the resulting array new_values ARRAY(STRING) for each id would be of length N and consist of random values from the original values array ( i.e. values picked at N random offsets in the array)

Consider below approach
select *, array(
select value from (
select value, offset
from t.values as value with offset
order by rand()
limit 5 -- replace 5 with value of your N
)
order by offset
) new_values
from your_table t

Related

Filtering a json field in presto to count occurences of integers

I have a json formatted field in a table like this:
I want to create a query to count how many times the numbers 0 to 10 appears in the key employee_nps -> value
If I try this query to count how many times the number 8 appears:
SELECT
count(CAST(filter(CAST(json_extract(answers, '$.employee_nps') AS ARRAY(MAP(VARCHAR, JSON))), x -> json_format(x['value'])='8') AS JSON))
FROM
table
I get the following error:
error querying the database: INVALID_CAST_ARGUMENT: Cannot cast to array(map(varchar, json)). Expected a json array, but got { {"value":["10"]}
Note: sometimes value can be empty and in rare situations value can have string instead of integers so I would like to check if value is integer first.
My expected result is:
0 - count 10
1 - count 120
...
10 - count 100

employee_nps property value (i.e. {"value": ["10"]}) is not an ARRAY(MAP(VARCHAR, JSON))) but just a MAP(VARCHAR, JSON)), so you need to cast to it.
But if you are interested only in the value you can extract exactly it with '$.employee_nps.value' path. Then you can cast the value to array(integer) (note that it will handle conversion from number strings) and process:
-- sample data
WITH dataset (answers) AS (
VALUES (json '{"employee_nps": {"value": ["8"]}}'),
(json '{"employee_nps": {"value": ["10", "11", "1"]}}')
)
--query
select cardinality( --count number of elements in array
filter(
cast(json_extract(answers, '$.employee_nps.value') as array(integer)),
i->i between 0 and 10 -- filter out elements not between 0 and 10
)
) result
from dataset
Output:
result
1
2
If you have json values for value which are not valid number arrays you can handle this cases with try_cast (or try depended on required logic).
UPD
To count numbers you can use unnest to flatten the arrays:
--query
select num, count(num) num_count
from dataset
cross join unnest (cast(json_extract(answers, '$.employee_nps.value') as array(integer))) t(num)
where num between 0 and 10
group by num
order by num
Output:
num
num_count
1
1
8
1
10
1

Insert numbers into a column PSQL

I have a table called Nums in which I have just one column called text. I need to create a PSQL query that will insert a row for each number between 100000 and 199999.
Row number is 100000
Row number is 100001
Row number is 100002
...
Row number is 199999
Obviously, if the range was smaller and only 10 numbers, it could be done with 10 simple insert statements, but this is not the case. I'm new to PSQL and want to know how this could be achived? Is a sort of loop needed?

You can use recursive queries.
with recursive cte as
(
select 100000 as n
union all
select n + 1
from cte
where n < 199999
)
select * from cte;
Try this dbfiddle

Assuming you mean "Postgres" for the database, use generate_series():
insert into nums (text)
select n
from generate_series(100000, 199999, 1) gs(n);
Postgres will automatically convert n to a string -- although it seems unusual to store numbers as strings. You can also be explicit using select n::text.
EDIT:
If you need the full string, it would just look like:
insert into nums (text)
select 'Row number is ' || (n::text)
from generate_series(100000, 199999, 1) gs(n);

hive sql - how to select the first n elements in hive array column and return the selected array

Please consider a hive table - Table as mentioned below.
user_id interest_array
tom [a,b,c,d,g,w]
bob [e,d,s,d,g,w,s]
cat [a]
harry []
peter NULL
I want to select the first 3 elements by sequence in 'interest_array' per row and return it as a array, the outout to be like below
user_id output_array
tom [a,b,c]
bob [e,d,s]
cat [a]
harry []
peter NULL
PS: the last two rows are not important, they are just corner case, I can just set them null if necessary.

1. Simple method, but it will not work correctly if initial array can contain less elements ( result array will contain NULLs ).
with mydata as(
select array('a','b','c','d','g','w') as original_array
)
select original_array, array(original_array[0], original_array[1], original_array[2]) as first_3_array
from mydata
Result:
original_array first_3_array
["a","b","c","d","g","w"] ["a","b","c"]
2. One more method using explode, works correctly with any arrays:
Explode array using posexplode, filter position<=2, collect array again:
with mydata as(
select array('a','b','c','d','g','w') as original_array
)
select original_array, collect_list(e.element) as first_3_array
from mydata
lateral view outer posexplode(original_array) e as pos, element
where pos<=2
group by original_array
Result:
original_array first_3_array
["a","b","c","d","g","w"] ["a","b","c"]
3. More efficient method, without explode: Concatenate array with comma delimiter, use regexp to extract substring with up to 3 first elements, split again:
with mydata as(
select array('a') as original_array
)
select original_array, split(regexp_replace(regexp_extract(concat_ws(',', original_array),
'^(([^,]*,?){1,3})',1),
',$','') --remove last delimiter
,',') as first_3_array
from mydata

SQL Server 2012 : update a row with unique number

I have a table with 50k records. Now I want to update one column of the table with a random number. The number should be 7 digits.
I don't want to do that with procedure or loop.
PinDetailId PinNo
--------------------
783 2722692
784 9888648
785 6215578
786 7917727
I have tried this code but not able to succeed. I need 7 digit number.
SELECT
FLOOR(ABS(CHECKSUM(NEWID())) / 2147483647.0 * 3 + 1) rn,
(FLOOR(2000 + RAND() * (3000 - 2000) )) AS rn2
FROM
[GeneratePinDetail]

Random
For a random number, you can use ABS(CHECKSUM(NewId())) % range + lowerbound:
(source: How do I generate random number for each row in a TSQL Select?)
INSERT INTO ResultsTable (PinDetailId, PinNo)
SELECT PinDetailId,
(ABS(CHECKSUM(NewId())) % 1000000 + 1000000) AS `PinNo`
FROM GeneratePinDetail
ORDER BY PinDetailId ASC;
Likely Not Unique
I cannot guarantee these will be unique; but it should be evenly distributed (equal chance of any 7 digit number). If you want to check for duplicates you can run this:
SELECT PinDetailId, PinNo
FROM ResultsTable result
INNER JOIN (
SELECT PinNo
FROM ResultsTable
GROUP BY PinNo
HAVING Count(1) > 1
) test
ON result.PinNo = test.PinNo;

You can create a sequence object and update your fields - it should automatically increment per update.
https://learn.microsoft.com/en-us/sql/t-sql/functions/next-value-for-transact-sql
Updated based on comment:
After retrieving the 'next value for' in the sequence, you can do operations on it to randomize. The sequence can basically be used then to create a unique seed for your randomization function.
If you don't want to create a function yourself, SQL Server has the RAND function build in already.
https://learn.microsoft.com/en-us/sql/t-sql/functions/rand-transact-sql

Combine elements of array into different array

I need to split text elements in an array and combine the elements (array_agg) by index into different rows
E.g., input is
'{cat$ball$x... , dog$bat$y...}'::text[]
I need to split each element by '$' and the desired output is:
{cat,dog} - row 1
{ball,bat} - row 2
{x,y} - row 3
...
Sorry for not being clear the first time. I have edited my question. I tried similar options but unable to figure out how to get it with multiple text elements separated with '$' sysmbol

Exactly two parts per array element (original question)
Use unnest(), split_part() and array_agg():
SELECT array_agg(split_part(t, '$', 1)) AS col1
, array_agg(split_part(t, '$', 2)) AS col2
FROM unnest('{cat$ball, dog$bat}'::text[]) t;
Related:
Split comma separated column data into additional columns
General solution (updated question)
For any number of arrays with any number of elements containing any number of parts.
Demo for a table tbl:
CREATE TABLE tbl (tbl_id int PRIMARY KEY, arr text[]);
INSERT INTO tbl VALUES
(1, '{cat1$ball1, dog2$bat2}') -- 2 parts per array element, 2 elements
, (2, '{cat$ball$x, dog$bat$y}') -- 3 parts ...
, (3, '{a1$b1$c1$d1, a2$b2$c2$d2, a3$b3$c3$d3}'); -- 4 parts, 3 elements
Query:
SELECT tbl_id, idx, array_agg(elem ORDER BY ord) AS pivoted_array
FROM tbl t
, unnest(t.arr) WITH ORDINALITY a1(string, ord)
, unnest(string_to_array(a1.string, '$')) WITH ORDINALITY a2(elem, idx)
GROUP BY tbl_id, idx
ORDER BY tbl_id, idx;
We are looking at two (nested) LATERAL joins here. LATERAL requires Postgres 9.3. Details:
What is the difference between LATERAL and a subquery in PostgreSQL?
WITH ORDINALITY for the the first unnest() is up for debate. A simpler query normally works, too. It's just not guaranteed to work according to SQL standards:
SELECT tbl_id, idx, array_agg(elem) AS pivoted_array
FROM tbl t
, unnest(t.arr) string
, unnest(string_to_array(string, '$')) WITH ORDINALITY a2(elem, idx)
GROUP BY tbl_id, idx
ORDER BY tbl_id, idx;
Details:
PostgreSQL unnest() with element number
WITH ORDINALITY requires Postgres 9.4 or later. The same back-patched to Postgres 9.3:
SELECT tbl_id, idx, array_agg(arr2[idx]) AS pivoted_array
FROM tbl t
, LATERAL (
SELECT string_to_array(string, '$') AS arr2 -- convert string to array
FROM unnest(t.arr) string -- unnest org. array
) x
, generate_subscripts(arr2, 1) AS idx -- unnest 2nd array with ord. numbers
GROUP BY tbl_id, idx
ORDER BY tbl_id, idx;
Each query returns:
tbl_id | idx | pivoted_array
--------+-----+---------------
1 | 1 | {cat,dog}
1 | 2 | {bat,ball}
1 | 3 | {y,x}
2 | 1 | {cat2,dog2}
2 | 2 | {ball2,bat2}
3 | 1 | {a3,a1,a2}
3 | 2 | {b1,b2,b3}
3 | 3 | {c2,c1,c3}
3 | 4 | {d2,d3,d1}
SQL Fiddle (still stuck on pg 9.3).
The only requirement for these queries is that the number of parts in elements of the same array is constant. We could even make it work for a varying number of parts using crosstab() with two parameters to fill in NULL values for missing parts, but that's beyond the scope of this question:
PostgreSQL Crosstab Query

A bit messy but you could unnest the array, use regex to separate the text and then aggregate back up again:
with a as (select unnest('{cat$ball, dog$bat}'::_text) some_text),
b as (select regexp_matches(a.some_text, '(^[a-z]*)\$([a-z]*$)') animal_object from a)
select array_agg(animal_object[1]) animal, array_agg(animal_object[2]) a_object
from b
If you're processing multiple records at once you may want to use something like a row number before the unnest so that you have a group by to aggregate back to an array in your final select statement.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

N random samples from an array in bigquery - google-bigquery

Consider below approach select *, array( select value from ( select value, offset from t.values as value with offset order by rand() limit 5 -- replace 5 with value of your N ) order by offset ) new_values from your_table t

Related

Filtering a json field in presto to count occurences of integers

Insert numbers into a column PSQL

hive sql - how to select the first n elements in hive array column and return the selected array

SQL Server 2012 : update a row with unique number

Combine elements of array into different array

Categories

Resources