How to sort an array in BigQuery standard SQL? - sql

I am wondering if it is possible to order (apply order by) for individual array values in Google BigQuery?
I am able to achieve this by applying order by on the whole transactonal base table first, then aggregating array; but when table is too large, resource errors appear for ordering by a large table..
So i am wondering if each individual array value can be ordered by using SQL or UDF.
This was asked once Order of data in bigquery repeated records but it was 4,5 years ago.

Sure, you can use the ARRAY function. It supports an optional ORDER BY clause. You haven't provided sample data, but supposing that you have a top level array column named arr, you can do something like this:
SELECT
col1,
col2,
ARRAY(SELECT x FROM UNNEST(arr) AS x ORDER BY x) AS arr
FROM MyTable;
This sorts the elements of arr by their values.
If you actually have an array of a struct type, such as ARRAY<STRUCT<a INT64, b STRING>>, you can sort by one of the struct fields:
SELECT
col1,
col2,
ARRAY(SELECT x FROM UNNEST(arr) AS x ORDER BY a) AS arr
FROM MyTable;

If the array is obtained after aggregation using a group by clause, the
query can look something like this:
SELECT
ARRAY_AGG(distinct col order by col)
FROM table
GROUP BY group_col
So, no SELECT is required.
Ref: The accepted answer didn't help. Took help from here - https://count.co/sql-resources/bigquery-standard-sql/array_agg

Related

How to use pypika to generate the following SQL query?

I have a table with arrays as one column, and I want to sum the array elements together.
for example, if I have two arrays:
[1,2,3] and [2,1,3]
the result array will look like:
[3,3,6]
This can be done with the following query:
SELECT ARRAY (
SELECT sum(elem)
FROM tbl t
, unnest(t.arr) WITH ORDINALITY x(elem, rn)
GROUP BY rn
ORDER BY rn
);
How can I use pypika to generate this exact query? I was trying to solve the problem using pypika 'CustomFunction' and 'AnalyticFunction'
I'm using PostgreSQL 11.8.1

SQL Count distinct number of rows in table in GBQ

I'd like to count the number of distinct rows in a table. I know that I can do that using groupby or by naming all the columns one by one, but would like to just do:
select count(distinct *) from my_table
Is that possible?
Do SELECT DISTINCT in a derived table (the subquery), then count the number of rows returned.
select count(*) from
(select distinct * from my_table) dt
(Doesn't your table have any primary key?)
You can use to_json_string():
select count(distinct to_json_string(t))
from t;
Below more options for BigQuery Standard SQL
select count(distinct format('%t', t))
from `project.dataset.table` t
depends on your use case - approximate count can be even more optimal option
select approx_count_distinct(format('%t', t))
from `project.dataset.table` t
APPROX_COUNT_DISTINCT - returns the approximate result for COUNT(DISTINCT expression). The value returned is a statistical estimate—not necessarily the actual value. This function is less accurate than COUNT(DISTINCT expression), but performs better on huge input.
The use of count(distinct *) is not permitted.
Alternatively you could explicitly name the columns (what defines uniqueness).

Sorting concatenated strings after grouping in Netezza

I'm using the code on this page to create concatenated list of strings on a group by aggregation basis.
https://dwgeek.com/netezza-group_concat-alternative-working-example.html/
I'm trying to get the concatenated string in sorted order, so that, for example, for DB1 I'd get data1,data2,data5,data9
I tied modifying the original code to selecting from a pre-sorted table but it doesn't seem to make any difference.
select Col1
, count(*) as NUM_OF_ROWS
, trim(trailing ',' from SETNZ..replace(SETNZ..replace (SETNZ..XMLserialize(SETNZ..XMLagg(SETNZ..XMLElement('X',col2))), '<X>','' ),'</X>' ,',' )) AS NZ_CONCAT_STRING
from
(select * from tbl_concat_demo order by 1,2) AS A
group by Col1
order by 1;
Is there a way to sort the strings before they get aggregated?
BTW - I'm aware there is a GROUP_CONCAT UDF function for Netezza, but I won't have access to it.
This is notoriously difficult to accomplish in sql, since sorting is usually done while returning the data, and you want to do it in the ‘input’ set.
Try this:
1)
Create temp table X as select * from tbl_concat_demo Order by col2
Partition by (col1)
In your original code above: select from X instead of tbl_concat_demo
Let me know if it works ?

distinct rows from bigquery table with array field

I have a bigquery table containing a field candidate of array type. How can I query distinct rows from this table?
In this case my query should return just the first row.
I think below is the simplest way and works for any types and length , etc.
#standardSQL
SELECT ANY_VALUE(candidate) candidate
FROM `project.dataset.table`
GROUP BY FORMAT('%T', candidate)
Previously I used to use TO_JSON_STRING() for this - but recently realized that FORMAT() fits best for most cases like this
Something like:
select split(combed, ".") as candidate from (
select distinct array_to_string(candidate, ".") as combed
from `dataset.table`
)

How to extract key and value of greatest value in Hive map?

I have in Hive a field which contains a map that looks like this:
{"258":0.10075276284486512,"259":0.00093852142318649,"262":0.015979321337627,"264":0.0020453444772401,"265":0.024689771044731,"268":0.018837925051338,"274":0.011282124863882}
I would like to extract the key [and value if possible] of this map of the greatest value for each row. In this case, the ideal function would look like this:
select max_val(col)
from table
Output:
max_val
"258"
"165"
"204"
explode the map column and then use a ranking function like rank to order the values as required and get first such row. (This assumes there is a way to identify a row with any other column than the map, id in the query shown below.)
select id,k,v
from (select id,k,v,rank() over(partition by id order by v desc) as rnum
from tbl
lateral view explode(mapCol) t as k,v
) t
where rnum=1