I have a table with a field A where each entry is a fixed length array A of integers (say length=1000). I want to know how to convert it into 1000 columns, with column name given by index_i, for i=0,1,2,...,999, and each element is the corresponding integer. I can have it done by something like
A[OFFSET(0)] as index_0,
A[OFFSET(1)] as index_1
A[OFFSET(2)] as index_2,
A[OFFSET(3)] as index_3,
A[OFFSET(4)] as index_4,
...
A[OFFSET(999)] as index_999,
I want to know what would be an elegant way of doing this. thanks!
The first thing to say is that, sadly, this is going to be much more complicated than most people expect. It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to be much more performant. So here is an approach.
Cross-joining to turn array fields into long-format tables
I think the first thing you're going to want to do is get the values out of the arrays and into rows.
Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw
CROSS JOIN UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
UNNEST(raw.a) is taking those arrays of values and turning each array into a set of (five) rows, every single one of which is then joined to the corresponding value of name (the definition of a CROSS JOIN). In this way we can 'unwrap' a table with an array field.
This will yields results like
name | vals
-------------
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1
Confusingly, there is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
This is more compact but may be confusing if you haven't seen it before.
Typically this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. What you're asking for is harder to produce - you want a wide-format table containing the same information (relying on the fact that each array was the same length.
Pivot tables in BigQuery
The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
)
This makes use of WITH OFFSET to generate an extra offset column (so that we know which order the values in the array originally had).
Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE, which non-deterministically selects a value from the group you're aggregating over. Since, in this case, each group has exactly one value, that's the value retrieved.
The query yields results like:
name vals_0 vals_1 vals_2 vals_3 vals_4
----------------------------------------------
A 1 2 3 4 5
B 5 4 3 2 1
This is starting to look pretty good, but we have a fundamental issue, in that the column names are still hard-coded. You wanted them generated dynamically.
Unfortunately expressions for the pivot column values aren't something PIVOT can accept out-of-the-box. Note that BigQuery has no way to know that your long-format table will resolve neatly to a fixed number of columns (it relies on offset having the values 0-4 for each and every set of records).
Dynamically building/executing the pivot
And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.
What we must do is use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct and execute a standard SQL query!
(as an aside, I bet you - OP or future searchers - weren't expecting this rabbit hole...)
This is, of course, inelegant to say the least. But here is the above toy example, implemented using EXECUTE IMMEDIATE. The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string.
Recall that || can be used as a string concatenation operator.
EXECUTE IMMEDIATE """
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
|| """
)
)
"""
Ouch. I've tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (pivoted values of offset):
(SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
This generates the string "0,1,2,3,4" which is then concatenated to give us ...FOR offset IN (0,1,2,3,4)... in our final query (as in the hard-coded example before).
REALLY dynamically executing the pivot
It hasn't escaped my notice that this is still technically insisting on your knowing up-front how long those arrays are! It's a big improvement (in the narrow sense of avoiding painful repetitive code) to use GENERATE_ARRAY(0,4), but it's not quite what was requested.
Unfortunately, I can't provide a working toy example, but I can tell you how to do it. You would simply replace the pivot values expression with
(SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM long_format)
But doing this in the example above won't work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statement in that block won't be executed until after building it, so at build-time long_format has yet to be defined.
Yet all is not lost. This will work just fine:
SELECT *
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM d.long_format)
|| """
)
)
... provided you first define a BigQuery VIEW (for example) called long_format (or, better, some more expressive name) in a dataset d. That way, both the job that builds the query and the job that runs it will have access to the values.
If successful, you should see both jobs execute and succeed. You should then click 'VIEW RESULTS' on the job that ran the query.
As a final aside, this assumes you are working from the BigQuery console. If you're instead working from a scripting language, that gives you plenty of options to either load and manipulate the data, or build the query in your scripting language rather than massaging BigQuery into doing it for you.
Consider below approach
execute immediate ( select '''
select * except(id) from (
select to_json_string(A) id, * except(A)
from your_table, unnest(A) value with offset
)
pivot (any_value(value) index for offset in ('''
|| (select string_agg('' || val order by offset) from unnest(generate_array(0,999)) val with offset) || '))'
)
If to apply to dummy data like below (with 10 instead of 1000 elements)
select [10,11,12,13,14,15,16,17,18,19] as A union all
select [20,21,22,23,24,25,26,27,28,29] as A union all
select [30,31,32,33,34,35,36,37,38,39] as A
the output is
Background
I have a front-end with a list of items with infinite scrolling, and I fetch pages of items by specifying the page limit and offset.
Problem
Apart from simply ordering the result by some of the columns, I would like to add a "random" option. The thing is, I don't want repetitions, so I need to have the entire dataset permutated before doing the limit and offset, and I need to be able to get the same permutation as long as I supply the same seed.
What I tried
A naive approach was to write a table-valued function that takes an int seed and uses it in the ORDER BY clause like so:
SELECT *
FROM dbo.Entities e
ORDER BY HASHBYTES('MD2', e.Title) ^ #seed
OFFSET 0 ROWS
FETCH NEXT (SELECT COUNT(*) FROM dbo.Entities) ROWS ONLY
This seemed to work well at a first glance, but it turned out it's not very "volatile" for the lack of better word - it becomes more visible with sparse result sets, where most seeds (chosen randomly from between 0 and 2147483647) yield the same order.
I thought I would get better results by hashing the seed as well, but SQL Server doesn't allow me to XOR two varbinary variables. Am I even looking in the right direction? Are there any performance considerations that I should be making and I might not be aware of?
The best way is to create a tally table with two columns: first a sequential integer, (between 1 and 1,000,000), second a random integer number. Then generate a random number to get the first value and then make a join with a computed ROW_NUMBER().
CREATE TABLE T_NUM (SEQUENTIAL INT, RANDOM INT);
GO
WITH
N AS(SELECT 0 AS I
UNION ALL
SELECT I + 1
FROM N
WHERE I < 9)
INSERT INTO T_NUM (SEQUENTIAL)
SELECT N1.I + N2.I * 10 + N3.I * 100 + N4.I * 1000 + N5.I * 10000 + N6.I * 100000
FROM N AS N1
CROSS JOIN N AS N2
CROSS JOIN N AS N3
CROSS JOIN N AS N4
CROSS JOIN N AS N5
CROSS JOIN N AS N6;
GO
WITH T AS
(
SELECT SEQUENTIAL, ROW_NUMBER() OVER (ORDER BY CHECKSUM(NEWID())) AS ALEA
FROM T_NUM
)
UPDATE N
SET RANDOM = ALEA
FROM T_NUM AS N
JOIN T ON T.SEQUENTIAL = N.SEQUENTIAL;
GO
DECLARE #SEED INT = FLOOR(1 + RAND() * 1000000);
Now you have your seed to enter in the alea sequence then join your table on sequential order
ORDER BY HASHBYTES('MD2', e.Title + convert(nvarchar(max), #seed))
should work, but performance-wise it would be a disaster. You would calculate MD2 for all records every time. I would not do this on server side at all. You can generate random sequence on client and then just pick from server rows with row number 158, 7, 1027 and 9. But it has still two problems
if item is deleted, row number of all consecutive records shifts. It would just break the whole sequence and you would get duplicities and missing records
row number over millions of records is not that fast either
I see two options. You can query all ids from the table and use them for generating of random order. But that would be a lot of numbers. Or you have to ensure the id space is dense enough. Then you can query 20 random ids and hope at least 10 of them exist. If you are unlucky, you would have to query again.
Our web project has crashed serveral times because of this problem.Most answers online suggests using xmlagg or clob,but still too troublesome.
So how can i rewrite a function like wmconcat or listagg that will only display the first few words in order to avoid the problem,the rest of the words will be replaced by ellipses?
An alternate approach is to skip the concatenation as soon as the length approaches 4000 characters.
Firstly, you need to find the running sum of length of the column you intend to append, following a particular ORDER. Find the maximum number of rows in that order, that you could afford to append before it reaches close to 3500 characters( excluding commas in the final string).
Next, concatenate the string using the same order using LISTAGG limiting to the number of rows found in the first step.
WITH cte(maxrows)
AS (SELECT Max(rn)
FROM (SELECT row_number()
over (
ORDER BY rcol ) rn,
SUM(Length(rcol))
over (
ORDER BY rcol) total_length
FROM yourtable)
WHERE total_length < 3500),
ltd AS (SELECT *
FROM (SELECT rcol,
maxrows,
row_number()
over (
ORDER BY rcol ) rn
FROM yourtable
cross join cte)
WHERE rn <= maxrows)
SELECT LISTAGG(rcol, ',')
within GROUP ( ORDER BY rcol ) less_than_4000
FROM ltd;
DEMO
Note: If you have duplicate entries in the column, it is advisable to take a DISTINCT set before the start of above processing, assuming that you won't need multiple values for a record in the concatenation.
I am trying to generate Fibonacci series using below query (recursive WITH clause).
WITH X(Pnbr,Cnbr) AS
(SELECT 0, 1 FROM dual
UNION ALL
SELECT X.Cnbr, X.Pnbr+X.Cnbr FROM X
WHERE X.Cnbr<50
)
SELECT * FROM X
But I am getting this error
ORA-32044: cycle detected while executing recursive WITH query
Why?
Your data at first iteration would be
PNBR CNBR
0 , 1
1 , 1 + 0
So, CNBR is 1 is first two rows.. A Cycle is detected!
The CONNECTING condition has to be unique!
So probably you would need to maintain an iterator.
ROWNUM is what I used here for it.
WITH X(iter,Pnbr,Cnbr) AS
(SELECT 1,0, 1 FROM dual
UNION ALL
SELECT iter + rownum, X.Cnbr, X.Pnbr+X.Cnbr FROM X
WHERE iter < 50
)
SELECT iter,Pnbr FROM X;
DEMO
I agree with the diagnosis in M. Ravisankar's Answer (from 2015), but not with the remedy.
To handle precisely the situation presented in the original post, recursive CTE offers the CYCLE clause. In this problem, while there will be repeated values in the Pnbr column as well as in the Cnbr column, when considered separately, there are no repeated values (duplicates) in the composite (Pnbr, Cnbr).
So, the query can be written like this:
WITH X(Pnbr,Cnbr) AS
(SELECT 0, 1 FROM dual
UNION ALL
SELECT X.Cnbr, X.Pnbr+X.Cnbr FROM X
WHERE X.Cnbr<50
)
cycle Pnbr, Cnbr set cycle to 'Y' default 'N' ----
SELECT Pnbr, Cnbr FROM X
Notice the cycle clause (second to last line), and also SELECT Pnbr, Cnbr as opposed to SELECT * (if we selected * here, we would also get the cycle column, which we don't need).
modify the column in the where clause. Use X.Pnbr+X.Cnbr instead of X.Cnbr as the condition so that Oracle uses the two referenced columns for the row cycling detection.
WITH X(Pnbr,Cnbr) AS
(SELECT 0, 1 FROM dual
UNION ALL
SELECT X.Cnbr, X.Pnbr+X.Cnbr FROM X
-- Cnbr column is used to detect the cycle data
WHERE X.Pnbr+X.Cnbr < 50
)
SELECT * FROM X;
According to the Oracle Doc:
If you omit the CYCLE clause, then the recursive WITH clause returns an error if cycles are discovered. In this case, a row forms a cycle if one of its ancestor rows has the same values for all the columns in the column alias list for query_name that are referenced in the WHERE clause of the recursive member.
Query Outputs:
Basically I have a database of words,
This database contains a rowID(primary key), the word and word length as table columns.
I want to select a random row where length = x and get the word at that row.
This is for an iPhone game project and it is high priority that the queries are as fast as possible (the searches are made in a game).
For instance:
SELECT * FROM WordsDB WHERE >= (abs(random()) %% (SELECT max(rowid) FROM WordsDB)) LIMIT 1;
This query is really fast at selecting a random row a lot faster than ORDER BY RANDOM() LIMIT 1, however, if I add the word length to the query I get issues:
SELECT * FROM WordsDB WHERE length = 9 AND rowid >= (abs(random()) %% (SELECT max(rowid) FROM WordsDB)) LIMIT 1
Presumably because the random row will not always have a length of 9.
I was just wondering what would be the fastest / most efficient way of doing this.
Thanks for your time
Note: the 2 % symbols are because it is in objective c and the query is set as a string.
This one seems to work ok for me:
select * from WordsDB
where length = 9
limit (abs(random()) % (select count(rowid) from WordsDB
where length = 9)), 1;
note that length = 9 appears in both where clauses.
Add index on length if it appears to be slow.
Add an index to the WordsDB.length
create index if not exists WordsDBLengthIndex on WordsDB (length);
should make selection on this field much faster