SparkSQL regexp_extract function java error - sql

I am trying to extract the id's starting with srsa from the table structure below
id reason_text_field
34394 {"initial_customer":"sda_WWyfr4AXY1fIAS", customer_result":"srsa_CAkAaAvNKL2OSD"}
in order to get the following output:
id srsa_id
34394 srsa_CAkAaAvNKL2OSD
but when I use the following SparkSQL function
REGEXP_EXTRACT(reason_text_field, 'srsa[^"]*') as srsa_id
I get this error:
java.lang.IndexOutOfBoundsException: No group

You need to specify the group to capture. Try this:
SELECT id,
REGEXP_EXTRACT(reason_text_field, '\"(srsa[^"]*)\"', 1) as srsa_id
-- or REGEXP_EXTRACT(reason_text_field, 'srsa[^"]*', 0) as srsa_id
FROM tb
Note however that you can also convert the text column reason_text_field into a map or struct using from_json then extract the field customer_result:
SELECT id,
from_json(reason_text_field, 'map<string,string>')['customer_result'] as srsa_id
FROM tb

Related

JSON and Teradata

I have the following JSON:
'{"0": false,"1": false,"barring": "BAR_ROAMING"}'
There is a propriety in teradata for Json that can be used to extract barring value F_JSON.barring --> BAR_ROAMING
But for the other 2, which are dynamic keys, how can I extract them?
You can use the JSONExtractValue function:
select JsonCol.JSONExtractValue('$.[0]') as FirstOne
, JsonCol.JSONExtractValue('$.[1]') as SecondOne
from (
select new json('{"0": false,"1": false,"barring": "BAR_ROAMING"}')
) MyJsonData(JsonCol)
https://docs.teradata.com/r/HN9cf0JB0JlWCXaQm6KDvw/aaGwlJOTKsXk4IaU7vsE6g
I ended up using
CREATE TABLE KEY_JSON AS (
SELECT DISTINCT(JSONKeys) J_KEY FROM Json_Keys
(
ON (SELECT JSON FROM JSON_TABLE) USING QUOTES('N'))
AS json_data) WITH DATA;
And performing a JOIN between my 2 tables (JSON_TABLE and KEY_JSON) ON JSON LIKE '%||J_KEY||%'
And extracting the value using JSONEXTRACT(JSON.'$."||J_KEY)

PG::InvalidParameterValue: ERROR: cannot extract element from a scalar

I'm fetching data from the JSON column by using the following query.
SELECT id FROM ( SELECT id,JSON_ARRAY_ELEMENTS(shipment_lot::json) AS js2 FROM file_table WHERE ('shipment_lot') = ('shipment_lot') ) q WHERE js2->> 'invoice_number' LIKE ('%" abc1123"%')
My Postgresql version is 9.3
Saved data in JSON column:
[{ "id"=>2981, "lot_number"=>1, "activate"=>true, "invoice_number"=>"abc1123", "price"=>378.0}]
However, I'm getting this error:
ActiveRecord::StatementInvalid (PG::InvalidParameterValue: ERROR: cannot extract element from a scalar:
SELECT id FROM
( SELECT id,JSON_ARRAY_ELEMENTS(shipment_lot::json)
AS js2 FROM file_heaps
WHERE ('shipment_lot') = ('shipment_lot') ) q
WHERE js2->> 'invoice_number' LIKE ('%abc1123%'))
How I can solve this issue.
Your issue is that you have improper JSON stored
If you try running your example data on postgres it will not run
SELECT ('[{ "id"=>2981, "lot_number"=>1, "activate"=>true, "invoice_number"=>"abc1123", "price"=>378.0}]')::json
This is the JSON formatted correctly:
SELECT ('[{ "id":2981, "lot_number":1, "activate":true, "invoice_number":"abc1123", "price":378.0}]')::json

Using Array(Tuple(LowCardinality(String), Int32)) in ClickHouse

I have a table
CREATE TABLE table (
id Int32,
values Array(Tuple(LowCardinality(String), Int32)),
date Date
) ENGINE MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (id, date)
but when executing the request
SELECT count(*)
FROM table
WHERE (arrayExists(x -> ((x.1) = toLowCardinality('pattern')), values) = 1)
I get an error
Code: 49. DB::Exception: Received from clickhouse:9000. DB::Exception: Cannot capture column 3 because it has incompatible type: got String, but LowCardinality(String) is expected..
If I replace the column 'values'
values Array(Tuple(String, Int32))
then the request is executed without errors.
What could be the problem when using Array(Tuple(LowCardinality(String), Int32))?
Until it will be fixed (see bug 7815), can be used this workaround:
SELECT uniqExact((id, date)) AS count
FROM table
ARRAY JOIN values
WHERE values.1 = 'pattern'
For the case when there are more than one Array-columns can be used this way:
SELECT uniqExact((id, date)) AS count
FROM
(
SELECT
id,
date,
arrayJoin(values) AS v,
arrayJoin(values2) AS v2
FROM table
WHERE v.1 = 'pattern' AND v2.1 = 'pattern2'
)
values Array(Tuple(LowCardinality(String), Int32)),
Do not use Tuple. It brings only cons.
It's still *2 files on the disk.
It gives twice slowdown then you extract only one tuple element
https://gist.github.com/den-crane/f20a2dce94a2926a1e7cfec7cdd12f6d
valuesS Array(LowCardinality(String)),
valuesI Array(Int32)

Cannot have map type columns in DataFrame which calls set operations

: org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column map_col is map
I have a hive table with a column of type - MAP<Float, Float>. I get the above error when I try to do an insertion on this table in a spark context. Insertion works fine without the 'distinct'.
create table test_insert2(`test_col` string, `map_col` MAP<INT,INT>)
location 's3://mybucket/test_insert2';
insert into test_insert2
select distinct 'a' as test_col, map(0,0) as map_col
Try to convert dataframe to .rdd then apply .distinct function.
Example:
spark.sql("select 'a'test_col,map(0,0)map_col
union all
select 'a'test_col,map(0,0)map_col").rdd.distinct.collect
Result:
Array[org.apache.spark.sql.Row] = Array([a,Map(0 -> 0)])

Big Query Regexp_Extract using Google Analytics url

How do I extract the id parameter below using Big Query Regexp_Extract some rows with page urls in them that look similar to :
url.com/id=userIDmadeUPofletterandnumbers&em=MemberType
eg url.com/id=asd1221231sf&em=studentMember
I have tried using:
a. REGEXP_EXTRACT(urlValue,"id=\w+") as Idvalue but I get the error message:
Invalid string literal: "id=\w+"
I am pretty close with this: REGEXP_EXTRACT(urlValue,"(id=.*&em)") however it shows me id=asd1221231sf&em and I want to exclude id= and &em at the end
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'url.com/id=userIDmadeUPofletterandnumbers&em=MemberType' urlValue UNION ALL
SELECT 'url.com/id=asd1221231sf&em=studentMember'
)
SELECT REGEXP_EXTRACT(urlValue, r'id=(\w+)') id, urlValue
FROM `project.dataset.table`
Row id urlValue
1 userIDmadeUPofletterandnumbers url.com/id=userIDmadeUPofletterandnumbers&em=MemberType
2 asd1221231sf url.com/id=asd1221231sf&em=studentMember