How can I get the type of a JSON value in Presto? - sql

I have a JSON string that I am querying with Presto and I want to aggregate by the types of values. To do this I need to get the value type. Specifically, for JSON like:
{
"a": 1,
"b": "a",
"c": true,
"d": [ 1 ],
"e": { "f": "g" },
}
I would like to get the value at $.a is an integer, the value at $.b is a string, etc. (The information doesn't need to be nested so it would be good enough to know that $.d is an array and $.e is an object).
typeof appears to return only varchar or json, depending on how you extract JSON from a string:
SELECT
typeof(json_extract(j, '$.a')),
typeof(json_extract_scalar(j, '$.a'))
FROM (SELECT '{"a":1,"b":"a","c":true,"d":[1]}' AS j);
gives me:
_col0 _col1
json varchar(32)
How can I determine the JSON value type for one of these fields?

json_extract gets a value in a JSON string as the json type.
json_format takes a json value and converts it back to a JSON string.
I think you can hack these two things together to get the type of a value at a position by examining the formatted string.
CASE
WHEN substr(value, 1, 1) = '{' THEN 'object'
WHEN substr(value, 1, 1) = '[' THEN 'array'
WHEN substr(value, 1, 1) = '"' THEN 'string'
WHEN substr(value, 1, 1) = 'n' THEN 'null'
WHEN substr(value, 1, 1) = 't' THEN 'boolean'
WHEN substr(value, 1, 1) = 'f' THEN 'boolean'
ELSE 'number'
END AS t
(where value is json_format(json_extract(json_string, '$.your_json_path')))
Here's an example:
SELECT obj, value,
CASE
WHEN substr(value, 1, 1) = '{' THEN 'object'
WHEN substr(value, 1, 1) = '[' THEN 'array'
WHEN substr(value, 1, 1) = '"' THEN 'string'
WHEN substr(value, 1, 1) = 'n' THEN 'null'
WHEN substr(value, 1, 1) = 't' THEN 'boolean'
WHEN substr(value, 1, 1) = 'f' THEN 'boolean'
ELSE 'number'
END AS t
FROM (
SELECT obj, json_format(json_extract(obj, '$.a')) AS value
FROM (
VALUES
('{"a":1}'),
('{"a":2.5}'),
('{"a":"abc"}'),
('{"a":true}'),
('{"a":false}'),
('{"a":null}'),
('{"a":[1]}'),
('{"a":{"h":"w"}}')
) AS t (obj)
)
Which produces this result:
obj value t
{"a":1} 1 number
{"a":2.5} 2.5 number
{"a":"abc"} "abc" string
{"a":true} true boolean
{"a":false} false boolean
{"a":null} null null
{"a":[1]} [1] array
{"a":{"h":"w"}} {"h":"w"} object

Related

Cratedb loop over column Array and see if the values fall between X and Y

Hey stack overflow so I have a column called itemPrices that is an Array of integers;
[43, 44, 55]
So I have an api that gives me two numbers, X and Y. I want to take those two numbers and compare it against the Array. If the number in the object falls within X and Y I would like to retrieve the contents. How would i do such a thing in crateDB?
This can also be solved by using the array_min(array) and array_max(array) scalar functions:
cr> CREATE TABLE t1 (arr ARRAY(INTEGER));
CREATE OK, 1 row affected (1.918 sec)
cr> INSERT INTO t1 (arr) VALUES ([43, 44, 45]), ([42, 22, 105]);
INSERT OK, 2 rows affected (0.112 sec)
cr> SELECT arr FROM t1 WHERE array_min(arr) >= 43 AND array_max(arr) <= 45;
+--------------+
| arr |
+--------------+
| [43, 44, 45] |
+--------------+
SELECT 1 row in set (0.008 sec)
CREATE OR REPLACE FUNCTION filter_item_price(ARRAY(REAL), INTEGER, INTEGER) RETURNS BOOLEAN
LANGUAGE JAVASCRIPT
AS 'function filter_item_price(array_integer, min_value, max_value) {
if(array_integer === null || array_integer === undefined) {
return false;
}
return array_integer.some(element => element >= min_value && element <= max_value);
}';
SELECT "itemPrices"
FROM scp_service_transaction.transactions_v2
WHERE "tenantId" = 'aptos-denim'
AND "retailLocationId" IN (161)
AND array_find("itemPrices", 98, 100) limit 100;
This can be solved with a User-Defined Function.
If you want to find the elements in the array, you could use a function like this:
CREATE OR REPLACE FUNCTION array_filter(ARRAY(INTEGER), INTEGER, INTEGER) RETURNS ARRAY(INTEGER)
LANGUAGE JAVASCRIPT
AS 'function array_filter(array_integer, min_value, max_value) {
return Array.prototype.filter.call(array_integer, element => element >= min_value && element <= max_value);
}';
SELECT array_filter([5, 7, 20], 2, 8)
-- returns [5, 7]
If you only want to identify if there is a value within the given boundaries, you can also do this:
CREATE OR REPLACE FUNCTION array_find(ARRAY(INTEGER), INTEGER, INTEGER) RETURNS BOOLEAN
LANGUAGE JAVASCRIPT
AS 'function array_find(array_integer, min_value, max_value) {
return Array.prototype.find.call(array_integer, element => element >= min_value && element <= max_value) !== undefined;
}';
SELECT array_find([5, 7, 20], 5, 300);
-- returns true
SELECT array_find([5, 7, 20], 25, 300);
-- returns false

How to compare two json objects using Karate where order of element is to be retained?

I am in a need to compare 2 JSON objects where the order has retained while comparing. As Karate match ignores the order of an element, I am just curious to know if there is a way to do so in Karate.
Not directly, it is never needed, since JSON keys can be in any order, like a Map.
But you can do an exact match after converting to a (normalized) string:
* def foo = { a: 1, b: 2 }
* string str1 = foo
* string str2 = { "a": 1, "b": 2 }
* assert str1 == str2
You can also get an ordered list of keys / values at any time:
* def vals = karate.valuesOf(foo)
* match vals == [1, 2]
* def keys = karate.keysOf(foo)
* match keys == ['a', 'b']

Snowflake SQL Regex ~ Extracting Multiple Vals

I am trying to identify a value that is nested in a string using Snowflakes regexp_substr()
The values that I want to access are in quotes:
...
Type:
a:
- !<string>
val: "A"
- !<string>
val: "B"
- !<string>
val: "C"
...
*There is a lot of text above and below this.
I want to extract A, B, and C for all columns. But I am unsure how. I have tried using regexp_substr() but haven't been able to isolate past the first value.
I have tried:
REGEXP_SUBSTR(col, 'Type\\W+(\\w+)\\W+\\w.+\\W+\\w.+')
which yields:
Type: a: - !<string> val: "A"
and while that gives the first portion of the string with "A", I just want a way to access "A", "B", and "C" individually.
This select statement will give you what you want ... sorta. You should notice that it will look for the a specific occurence of "val" and then give you the next word character after that.
REGEX to my knowledge evaluates to the first occurence of the expression, so once the pattern is found it's done. You may want to look at the Snowflake JavaScript Stored Procedure to see if you can take the example below and iterate through, incrementing the appropriate value to produce the expected output.
SELECT REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 1, 'e', 1) as A,
REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 2, 'e', 1) as B,
REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 3, 'e', 1) as C;
You have to extract the values in two stages;
Extract the section of the document below Type: a: containing all the val: "data".
Extract the "data" as an array or use REGEXP_SUBSTR() + index n to extract the nth element
SELECT
'Type:\\s+\\w+:((\\s+- !<string>\\s+val:\\s+"[^"]")+)' type_section_rx
REGEXP_SUBSTR(col, type_section_rx, 1, 1, 'i', 1) vals,
PARSE_JSON('[0' || REPLACE(vals, REGEXP_SUBSTR(vals, '[^"]+'), ', ') || ']') raw_array,
ARRAY_SLICE(raw_array, 1, ARRAY_SIZE(raw_array)) val_array,
val_array[1] B
FROM INPUT_STRING
The result is an array where you can access the first value with the index [0] etc.
The first regexp can be shortened down to a "least effort" 'Type:\\s+\\w+:(([^"]+"[^"]+")+)'.
One more angle -- Use javascript regex capabilities in a UDF.
For example:
create or replace function my_regexp(S text)
returns array
language javascript
as
$$
const re = /(\w+)/g
return [...S.match(re)]
$$
;
Invoked this way:
set S = '
Type:
a:
- !<string>
val: "A"
- !<string>
val: "B"
- !<string>
val: "C"
';
select my_regexp($S);
Yields:
[ "Type", "a", "string", "val", "A", "string", "val", "B", "string", "val", "C" ]
Implementing your full regex is a little more work but as you can see, this gets around the single value limitation.
That said, if performance is your priority, I would expect Snowflake native regex support to outperform, even though you specify the regex multiple times, though I haven't tested this.

PySpark dataframe join on customized join condition

I have df_a, df_b, I want to join them on a customized condition: coalesce(df_a.id, 0) + 1 == df_b.id, how should I write the code?
I tried df_joined = df_a.join(df_b, coalesce(df_a.id, 0) + 1 == df_b.id)
But got error: Invalid argument, not a string or column: 0 of type <type 'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function

Applying aggregate functions to multiple properties with LINQ GroupBy

I have a list of Object (it's called: sourceList)
Object contains: Id, Num1, Num2, Num3, Name, Lname
Assume I have the following list:
1, 1, 5, 9, 'a', 'b'
1, 2, 3, 2, 'b', 'm'
2, 5, 8, 7, 'r', 'a'
How can I return another list (of object2) that returns a new list:
Id, sum of num1, sum of num2
For the example above, it should return a list of object2 that contains:
1, 3, 8
2, 5, 8
I tried:
Dim a = sourceList.GroupBy(Function(item) item.Id).
Select(Function(x) x.Sum(Function(y) y.Num1)).ToList()
But I don't know how to sum num2.
What you need is to combine anonymous types with initializer expressions, which is the way LINQ in fluent style does it implicitly. So e.g.
Dim a = sourceList.GroupBy(Function(item) item.Id).
Select(Function(x) New With {.Id = x.Key,
.Sum1 = x.Sum(Function(y) y.Num1),
.Sum2 = x.Sum(Function(y) y.Num2),
.Sum3 = x.Sum(Function(y) y.Num3)}).ToList()
Tested this and it seems to work fine. (The equivalent fluent style is
Dim fluent = (From item In sourceList
Group By item.Id Into Sum1 = Sum(item.Num1),
Sum2 = Sum(item.Num2),
Sum3 = Sum(item.Num3)).ToList()
which also works great and is arguably a bit easier to read.)