extract json in column - sql

I have a table
id | status | outgoing
-------------------------
1 | paid | {"a945248027_14454878":"processing","old.a945248027_14454878":"cancelled"}
2 | pending| {"069e5248cf_45299995":"processing"}
I am trying to extract the values after each underscore in the outgoing column e.g from a945248027_14454878 I want 14454878
Because the json data is not standardised I can't seem to figure it out.

You may extract the json key part after the underscore using regexp version of substring.
select id, status, outgoing,
substring(key from '_([^_]+)$') as key
from the_table, lateral jsonb_object_keys(outgoing) as j(key);
See demo.

Related

Extract Values based on Keys in a Bigquery column

I have data in the form of key value pair (Not Json) as shown below
id | Attributes
---|---------------------------------------------------
12 | Country:US, Eligibility:Yes, startDate:2022-08-04
33 | Country:CA, Eligibility:Yes, startDate:2021-12-01
11 | Country:IN, Eligibility:No, startDate:2019-11-07
I would like to extract only startDate from Attributes section
Expected Output:
id | Attributes_startDate
---|----------------------
12 | 2022-08-04
33 | 2021-12-01
11 | 2019-11-07
One way that I tried was, I tired converting the Attributes column in the Input data into JSON by appending {, } at start and end positions respectively. Also some how tried adding double quotes on the Key values and tried extracting startDate. But, is there any other effective solution to extract startDate as I don't want to rely on Regex.
Is there any way to just specify the key and extract its respective value? (Just like the way we can extract values using keys in a JSON column using JSON_QUERY)
you can try below
create temp function fakejson_extract(json string, attribute string) as ((
select split(kv, ':')[safe_offset(1)]
from unnest(split(json)) kv
where trim(split(kv, ':')[offset(0)]) = attribute
));
select id,
fakejson_extract(Attributes, 'Country') as Country,
fakejson_extract(Attributes, 'Eligibility') as Eligibility,
fakejson_extract(Attributes, 'startDate') as startDate
from your_table
if applied to sample data in your question - output is
is there any other effective solution to extract startDate as I don't want to rely on Regex.
If your feeling about Regex here really very strong - use below
select id, split(Attribute, ':')[safe_offset(1)] Attributes_startDate
from your_table, unnest(split(Attributes)) Attribute
where trim(split(Attribute, ':')[offset(0)]) = 'startDate'
Use below (I think using RegEx here is the most efficient option)
select id, regexp_extract(Attributes, r'startDate:(\d{4}-\d{2}-\d{2})') Attributes_startDate
from your_table
if applied to sample data in your question - output is

Incremental integer ID in Impala

I am using Impala for querying parquet-tables and cannot find a solution to increment an integer-column ranging from 1..n. The column is supposed to be used as ID-reference. Currently I am aware of the uuid() function, which
Returns a universal unique identifier, a 128-bit value encoded as a string with groups of hexadecimal digits separated by dashes.
Anyhow, this is not suitable for me since I have to pass the ID to another system which requests an ID in style of 1..n. I also already know that Impala has no auto-increment-implementation.
The desired result should look like:
-- UUID() provided as example - I want to achieve the `my_id`-column.
| my_id | example_uuid | some_content |
|-------|--------------|--------------|
| 1 | 50d53ca4-b...| "a" |
| 2 | 6ba8dd54-1...| "b" |
| 3 | 515362df-f...| "c" |
| 4 | a52db5e9-e...| "d" |
|-------|--------------|--------------|
How can I achieve the desired result (integer-ID ranging from 1..n)?
Note: This question differs from this one which specifically handles Kudu-tables. However, answers should be applicable for this question as well.
Since other Q&A's like this one only came up with uuid()-alike answers, I put some thought in it and finally came up with this solution:
SELECT
row_number() OVER (PARTITION BY "dummy" ORDER BY "dummy") as my_id
, some_content
FROM some_table
row_number() generates a continuous integer-number over a provided partition. Unlike rank(), row_number() always provides an incremented number on its partition (even if duplicates occur)
PARTITION BY "dummy" partitions the entire table into one partition. This works since "dummy" is interpreted in the execution graph as temporary column yielding only the String-value "dummy". Thus, also something analog to "dummy" works.
ORDER BY is required in order to generate the increment. Since we don't care about the order in this example (otherwise just set your respective column), also use the "dummy"-workaround.
The command creates the desired incremental ID without any nested SQL-statements or other tricks.
| my_id | some_content |
|-------|--------------|
| 1 | "a" |
| 2 | "b" |
| 3 | "c" |
| 4 | "d" |
|-------|--------------|
I used Markus's answer for a large partitioned table and found that I was getting duplicate ids. I think the ids were only unique within their partition; possibly PARTITION BY "dummy" leads Impala to think that each partition can execute row_number() on its own. I was able to get it working by specifying an actual column to order by and no partition by:
SELECT
row_number() OVER (ORDER BY actual_column) as my_id
, some_content
FROM some_table
It doesn't seem to matter whether the values in the column are unique (mine weren't), but using the actual partition key might result in the same issue as the "dummy" column.
Understandably, it took a lot longer to run than the dummy version.

Combine query to get all the matching search text in right order

I have the following table:
postgres=# \d so_rum;
Table "public.so_rum"
Column | Type | Collation | Nullable | Default
-----------+-------------------------+-----------+----------+---------
id | integer | | |
title | character varying(1000) | | |
posts | text | | |
body | tsvector | | |
parent_id | integer | | |
Indexes:
"so_rum_body_idx" rum (body)
I wanted to do phrase search query, so I came up with the below query, for example:
select id from so_rum
where body ## phraseto_tsquery('english','Is it possible to toggle the visibility');
This gives me the results, which only match's the entire text. However, there are documents, where the distance between lexmes are more and the above query doesn't gives me back those data. For example: 'it is something possible to do toggle between the. . . visibility' doesn't get returned. I know I can get it returned with <2> (for example) distance operator by giving in the to_tsquery, manually.
But I wanted to understand, how to do this in my sql statement itself, so that I get the results first with distance of 1 and then 2 and so on (may be till 6-7). Finally append results with the actual count of the search words like the following query:
select count(id) from so_rum
where body ## to_tsquery('english','string & string . . . ')
Is it possible to do in a single query with good performance?
I don't see a canned solution to this. It sounds like you need to use plainto_tsquery to get all the results with all the lexemes, and then implement your own custom ranking function to rank them by distance between the lexemes, and maybe filter out ones with the wrong order.

postgresql: automated extracting strings from text

I've the following table in a postgresl database
id | species
----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 |[{"id":1,"animalName":"Lupo appennico","animalCode":"LUPO"},{"id":2,"animalName":"Orso bruno marsicano","animalCode":"ORSO"},{"id":3,"animalName":"Volpe","animalCode":"VOLPE"}]
----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2 |[{"id":1,"animalName":"Cinghiale","animalCode":"CINGHIALE"},{"id":2,"animalName":"Orso bruno marsicano","animalCode":"ORSO"},{"id":3,"animalName":"Cervo","animalCode":"CERVO"}]|
I would like to extract only values after '"animalName":' and put them in a new field.
id | new_field |
----+--------------------------------------------+
1 |Lupo appennico, Orso bruno marsicano,Volpe |
----+--------------------------------------------+
2 |Cinghiale, Orso bruno marsicano, Cervo |
Unfortunately the field is a text type (not json or array). I've tried with regexp without success.
Your column is not of a json datatype, but it seems to contain valid json. If so, you can cast it and use json functions on it:
select id, string_agg(j ->> 'animalName', ', ') new_field
from mytable t
cross join lateral jsonb_array_elements(t.species::jsonb) j(obj)
group by id
order by id
Demo on DB Fiddle:
id | new_field
-: | :------------------------------------------
1 | Lupo appennico, Orso bruno marsicano, Volpe
2 | Cinghiale, Orso bruno marsicano, Cervo

Google BigQuery - Parsing string data from a Bigquery table column

I have a table A within a dataset in Bigquery. This table has multiple columns and one of the columns called hits_eventInfo_eventLabel has values like below:
{ID:AEEMEO,Score:8.990000;ID:SEAMCV,Score:8.990000;ID:HBLION;Property
ID:DNSEAWH,Score:0.391670;ID:CP1853;ID:HI2367;ID:H25600;}
If you write this string out in a tabular form, it contains the following data:
**ID | Score**
AEEMEO | 8.990000
SEAMCV | 8.990000
HBLION | -
DNSEAWH | 0.391670
CP1853 | -
HI2367 | -
H25600 | -
Some IDs have scores, some don't. I have multiple records with similar strings populated under the column hits_eventInfo_eventLabel within the table.
My question is how can I parse this string successfully WITHIN BIGQUERY so that I can get a list of property ids and their respective recommendation scores (if existing)? I would like to have the order in which the IDs appear in the string to be preserved after parsing this data.
Would really appreciate any info on this. Thanks in advance!
I would use combination of SPLIT to separate into different rows and REGEXP_EXTRACT to separate into different columns, i.e.
select
regexp_extract(x, r'ID:([^,]*)') as id,
regexp_extract(x, r'Score:([\d\.]*)') score from (
select split(x, ';') x from (
select 'ID:AEEMEO,Score:8.990000;ID:SEAMCV,Score:8.990000;ID:HBLION;Property ID:DNSEAWH,Score:0.391670;ID:CP1853;ID:HI2367;ID:H25600;' as x))
It produces the following result:
Row id score
1 AEEMEO 8.990000
2 SEAMCV 8.990000
3 HBLION null
4 DNSEAWH 0.391670
5 CP1853 null
6 HI2367 null
7 H25600 null
You can write your own JavaScript functions in BigQuery to get exactly what you want now: http://googledevelopers.blogspot.com/2015/08/breaking-sql-barrier-google-bigquery.html