Collapse elements of array of structs in BigQuery - sql

I have an array of structs in BigQuery that looks like:
"categories": [
{
"value": "A",
"question": "Q1",
},
{
"value": "B",
"question": "Q2",
},
{
"value": "C",
"question": "Q3",
}
]
I'd like to collapse the values "A", "B" and "C" into a separate column, and the value for this particular row should be something like "A - B - C".
How can I do this with a query in BigQuery?

Consider below
select id,
( select string_agg(value, ' - ')
from t.questions_struct) values
from questions t
if applied to sample data in your question/answer -
with questions as (
SELECT 1 AS id,
[
STRUCT("A" as value, "Q1" as question),
STRUCT("B" as value, "Q2" as question),
STRUCT("C" as value, "Q3" as question)
] AS questions_struct
)
output is

Assuming this is an array of structs, you can use:
select (select q.value from unnest(ar) q where q.question = 'q1') as q1,
(select q.value from unnest(ar) q where q.question = 'q2') as q2,
(select q.value from unnest(ar) q where q.question = 'q3') as q3
from t;

I think it can be done with the following code:
with questions as (
SELECT 1 AS id,
[
STRUCT("A" as value, "Q1" as question),
STRUCT("B" as value, "Q2" as question),
STRUCT("C" as value, "Q3" as question)
] AS questions_struct
), unnested as (
select * from questions, unnest(questions_struct) as questions_struct
) select id, string_agg(value, ' - ') from unnested group by 1

Related

How can I get the type of a JSON value in Presto?

I have a JSON string that I am querying with Presto and I want to aggregate by the types of values. To do this I need to get the value type. Specifically, for JSON like:
{
"a": 1,
"b": "a",
"c": true,
"d": [ 1 ],
"e": { "f": "g" },
}
I would like to get the value at $.a is an integer, the value at $.b is a string, etc. (The information doesn't need to be nested so it would be good enough to know that $.d is an array and $.e is an object).
typeof appears to return only varchar or json, depending on how you extract JSON from a string:
SELECT
typeof(json_extract(j, '$.a')),
typeof(json_extract_scalar(j, '$.a'))
FROM (SELECT '{"a":1,"b":"a","c":true,"d":[1]}' AS j);
gives me:
_col0 _col1
json varchar(32)
How can I determine the JSON value type for one of these fields?
json_extract gets a value in a JSON string as the json type.
json_format takes a json value and converts it back to a JSON string.
I think you can hack these two things together to get the type of a value at a position by examining the formatted string.
CASE
WHEN substr(value, 1, 1) = '{' THEN 'object'
WHEN substr(value, 1, 1) = '[' THEN 'array'
WHEN substr(value, 1, 1) = '"' THEN 'string'
WHEN substr(value, 1, 1) = 'n' THEN 'null'
WHEN substr(value, 1, 1) = 't' THEN 'boolean'
WHEN substr(value, 1, 1) = 'f' THEN 'boolean'
ELSE 'number'
END AS t
(where value is json_format(json_extract(json_string, '$.your_json_path')))
Here's an example:
SELECT obj, value,
CASE
WHEN substr(value, 1, 1) = '{' THEN 'object'
WHEN substr(value, 1, 1) = '[' THEN 'array'
WHEN substr(value, 1, 1) = '"' THEN 'string'
WHEN substr(value, 1, 1) = 'n' THEN 'null'
WHEN substr(value, 1, 1) = 't' THEN 'boolean'
WHEN substr(value, 1, 1) = 'f' THEN 'boolean'
ELSE 'number'
END AS t
FROM (
SELECT obj, json_format(json_extract(obj, '$.a')) AS value
FROM (
VALUES
('{"a":1}'),
('{"a":2.5}'),
('{"a":"abc"}'),
('{"a":true}'),
('{"a":false}'),
('{"a":null}'),
('{"a":[1]}'),
('{"a":{"h":"w"}}')
) AS t (obj)
)
Which produces this result:
obj value t
{"a":1} 1 number
{"a":2.5} 2.5 number
{"a":"abc"} "abc" string
{"a":true} true boolean
{"a":false} false boolean
{"a":null} null null
{"a":[1]} [1] array
{"a":{"h":"w"}} {"h":"w"} object

SQL function not displaying two decimal places although input parameter value is float

I have a function that rounds to the nearest value in SQL as per below. When I pass my value in and run the function manually, it works as expected. However when I use it within a select statement, it removes the decimal places.
E.g. I expect the output to be 9.00 but instead I only see 9.
CREATE FUNCTION [dbo].[fn_PriceLadderCheck]
(#CheckPrice FLOAT,
#Jur VARCHAR(10))
RETURNS FLOAT
AS
BEGIN
DECLARE #ReturnPrice FLOAT
IF (#Jur = 'SE')
BEGIN
SET #ReturnPrice = (SELECT [Swedish Krona ]
FROM tbl_priceladder_swedishkrona
WHERE [Swedish Krona ] = #CheckPrice +
(SELECT MIN(ABS([Swedish Krona ] - #CheckPrice))
FROM tbl_priceladder_swedishkrona)
OR [Swedish Krona ] = #CheckPrice -
(SELECT MIN(ABS([Swedish Krona ] - #CheckPrice))
FROM tbl_priceladder_swedishkrona))
END
IF (#Jur = 'DK')
BEGIN
SET #ReturnPrice = (SELECT [Danish Krone ]
FROM tbl_priceladder_danishkrone
WHERE [Danish Krone ] = #CheckPrice +
(SELECT MIN(ABS([Danish Krone ] - #CheckPrice))
FROM tbl_priceladder_danishkrone)
OR [Danish Krone ] = #CheckPrice -
(SELECT MIN(ABS([Danish Krone ] - #CheckPrice))
FROM tbl_priceladder_danishkrone))
END
RETURN #ReturnPrice
END
Run SQL manually:
declare #checkprice float
set #checkprice = '10.3615384615385'
SELECT [Swedish Krona ]
FROM tbl_priceladder_swedishkrona
WHERE [Swedish Krona ] = #CheckPrice +
( SELECT MIN(ABS([Swedish Krona ] - #CheckPrice))
FROM tbl_priceladder_swedishkrona
)
OR [Swedish Krona ] = #CheckPrice -
( SELECT MIN(ABS([Swedish Krona ] - #CheckPrice))
FROM tbl_priceladder_swedishkrona
)
When I use this function with a SQL select statement for some reason it removes the 2 decimal points.
SELECT
Article, Colour,
dbo.fn_PriceLadderCheck([New Price], 'se') AS [New Price]
FROM
#temp2 t
[New Price] on its own is example output is 10.3615384615385
Any ideas?
Cast the result into a Decimal and specify the scale.
See the example below.
RETURN SELECT CAST(#ReturnPrice AS DECIMAL(16,2))

Snowflake SQL Regex ~ Extracting Multiple Vals

I am trying to identify a value that is nested in a string using Snowflakes regexp_substr()
The values that I want to access are in quotes:
...
Type:
a:
- !<string>
val: "A"
- !<string>
val: "B"
- !<string>
val: "C"
...
*There is a lot of text above and below this.
I want to extract A, B, and C for all columns. But I am unsure how. I have tried using regexp_substr() but haven't been able to isolate past the first value.
I have tried:
REGEXP_SUBSTR(col, 'Type\\W+(\\w+)\\W+\\w.+\\W+\\w.+')
which yields:
Type: a: - !<string> val: "A"
and while that gives the first portion of the string with "A", I just want a way to access "A", "B", and "C" individually.
This select statement will give you what you want ... sorta. You should notice that it will look for the a specific occurence of "val" and then give you the next word character after that.
REGEX to my knowledge evaluates to the first occurence of the expression, so once the pattern is found it's done. You may want to look at the Snowflake JavaScript Stored Procedure to see if you can take the example below and iterate through, incrementing the appropriate value to produce the expected output.
SELECT REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 1, 'e', 1) as A,
REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 2, 'e', 1) as B,
REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 3, 'e', 1) as C;
You have to extract the values in two stages;
Extract the section of the document below Type: a: containing all the val: "data".
Extract the "data" as an array or use REGEXP_SUBSTR() + index n to extract the nth element
SELECT
'Type:\\s+\\w+:((\\s+- !<string>\\s+val:\\s+"[^"]")+)' type_section_rx
REGEXP_SUBSTR(col, type_section_rx, 1, 1, 'i', 1) vals,
PARSE_JSON('[0' || REPLACE(vals, REGEXP_SUBSTR(vals, '[^"]+'), ', ') || ']') raw_array,
ARRAY_SLICE(raw_array, 1, ARRAY_SIZE(raw_array)) val_array,
val_array[1] B
FROM INPUT_STRING
The result is an array where you can access the first value with the index [0] etc.
The first regexp can be shortened down to a "least effort" 'Type:\\s+\\w+:(([^"]+"[^"]+")+)'.
One more angle -- Use javascript regex capabilities in a UDF.
For example:
create or replace function my_regexp(S text)
returns array
language javascript
as
$$
const re = /(\w+)/g
return [...S.match(re)]
$$
;
Invoked this way:
set S = '
Type:
a:
- !<string>
val: "A"
- !<string>
val: "B"
- !<string>
val: "C"
';
select my_regexp($S);
Yields:
[ "Type", "a", "string", "val", "A", "string", "val", "B", "string", "val", "C" ]
Implementing your full regex is a little more work but as you can see, this gets around the single value limitation.
That said, if performance is your priority, I would expect Snowflake native regex support to outperform, even though you specify the regex multiple times, though I haven't tested this.

Postgresql how to return select as array

IN POSTGRESQL:
Say I have a table of Users and I do:
SELECT "EyeColor" FROM "Users" WHERE "Age" = 32
And this returns:
[
{"EyeColor": "blue"},
{"EyeColor": "green"},
{"EyeColor": "blue"}
]
Then I want to put blue, green, blue into an array and use that. This is the closest I can get but it is not working:
SELECT * FROM "Eyes" WHERE "Color" IN
(SELECT array_agg("EyeColor") FROM "Users" WHERE "Age" = 32)
I want the anpve query to work the same as this:
SELECT * FROM "Eyes" WHERE "Color" IN ('blue', 'green')
You do not need to aggregate the subquery result into an array. You can use IN (subquery):
SELECT *
FROM "Eyes"
WHERE "Color" IN (
SELECT "Eyes"
FROM "Users"
WHERE "Age" = 32)
or ANY (subquery):
SELECT *
FROM "Eyes"
WHERE "Color" = ANY(
SELECT "Eyes"
FROM "Users"
WHERE "Age" = 32)

Pandas Groupby Sum to Columns

I am doing groupby and can get the sum for a column ok, but how to I get sum of two columns together?
detail [ 'debit' ] = df.groupby ( 'type' ) [ 'debit' ].sum ()
detail [ 'credit' ] = df.groupby ( 'type' ) [ 'credit' ].sum ()
Now I need the (credit - debit) together.
something like this:
detail [ 'profit' ] = df.groupby ( 'type' ) ( [ 'credit' ] - [ 'debit' ] ).sum ()
obviously that does not work.
Thanks.
Like #IanS suggested, I would first save the result in a new column and apply the groupby function afterwards:
df['profit'] = df['credit'] - df['debit']
detail = df.groupby('type').sum()[['profit', 'credit', 'debit']]
I also combined the the groupby-actions into one.
Have you tried:
detail [ 'profit' ] = sum(df.groupby ( 'type' ) [ 'credit' ] - df.groupby ( 'type' ) [ 'debit' ])