Extract Values based on Keys in a Bigquery column - sql

I have data in the form of key value pair (Not Json) as shown below
id | Attributes
---|---------------------------------------------------
12 | Country:US, Eligibility:Yes, startDate:2022-08-04
33 | Country:CA, Eligibility:Yes, startDate:2021-12-01
11 | Country:IN, Eligibility:No, startDate:2019-11-07
I would like to extract only startDate from Attributes section
Expected Output:
id | Attributes_startDate
---|----------------------
12 | 2022-08-04
33 | 2021-12-01
11 | 2019-11-07
One way that I tried was, I tired converting the Attributes column in the Input data into JSON by appending {, } at start and end positions respectively. Also some how tried adding double quotes on the Key values and tried extracting startDate. But, is there any other effective solution to extract startDate as I don't want to rely on Regex.

Is there any way to just specify the key and extract its respective value? (Just like the way we can extract values using keys in a JSON column using JSON_QUERY)
you can try below
create temp function fakejson_extract(json string, attribute string) as ((
select split(kv, ':')[safe_offset(1)]
from unnest(split(json)) kv
where trim(split(kv, ':')[offset(0)]) = attribute
));
select id,
fakejson_extract(Attributes, 'Country') as Country,
fakejson_extract(Attributes, 'Eligibility') as Eligibility,
fakejson_extract(Attributes, 'startDate') as startDate
from your_table
if applied to sample data in your question - output is

is there any other effective solution to extract startDate as I don't want to rely on Regex.
If your feeling about Regex here really very strong - use below
select id, split(Attribute, ':')[safe_offset(1)] Attributes_startDate
from your_table, unnest(split(Attributes)) Attribute
where trim(split(Attribute, ':')[offset(0)]) = 'startDate'

Use below (I think using RegEx here is the most efficient option)
select id, regexp_extract(Attributes, r'startDate:(\d{4}-\d{2}-\d{2})') Attributes_startDate
from your_table
if applied to sample data in your question - output is

Related

extract json in column

I have a table
id | status | outgoing
-------------------------
1 | paid | {"a945248027_14454878":"processing","old.a945248027_14454878":"cancelled"}
2 | pending| {"069e5248cf_45299995":"processing"}
I am trying to extract the values after each underscore in the outgoing column e.g from a945248027_14454878 I want 14454878
Because the json data is not standardised I can't seem to figure it out.
You may extract the json key part after the underscore using regexp version of substring.
select id, status, outgoing,
substring(key from '_([^_]+)$') as key
from the_table, lateral jsonb_object_keys(outgoing) as j(key);
See demo.

SQL - regexp_extract JSON string

I'm trying to extract from the following string the locationID
{"type":"player","topic_id":"555","topic_name":"sfd","userId":116,"userLocation":{"countryCode":"BR","locationId":21,"locationCity":"Rio de Janeiro"}}
I'm able to extract for example the topic_id using the following safe_cast(regexp_extract(h.events.label,r'"topic_id":"([a-zA-Z0-9-_. ]+)"') as int64
but this doesn't work for locationId. I'm guessing it's because of the nested dict? But not sure how to get around that.
You'd better using a json function rather than a regexp function.
WITH sample_data AS (
SELECT '{"type":"player","topic_id":"555","topic_name":"sfd","userId":116,"userLocation":{"countryCode":"BR","locationId":21,"locationCity":"Rio de Janeiro"}}' json
)
SELECT CAST(JSON_VALUE(json, '$.userLocation.locationId') AS INT64) AS locationId
FROM sample_data;
+------------+
| locationId |
+------------+
| 21 |
+------------+
this doesn't work for locationId. I'm guessing it's because of the nested dict?
I guess it's because a value of topic_id is a string "555" and a value of locationId is an integer 21.
r'"locationId":([a-zA-Z0-9-_. ]+)' will work for locationId but more simple regular expression would be r'"locationId":(\d+)'

How to parse retrieve value from a json string field in Redshift/SQL

I have a row that looks like this:
id | json_list | expected_result
"1" | [{"id":"1", "text":"text1"},{"id":"3", "text":"text3"}] | "text1"
"2" | [{"id":"2", "text":"text2"},{"id":"3", "text":"text3"}] | "text2"
I want to retrieve the "text" field based on the id column. How can I achieve that in AWS Redshift? I know Redshift has some json functions and it needs to be paired with some kind of loop condition, but I wasn't sure if it's possible in SQL
Please let me know , If I have understood your question correctly or not.
Because the data is a bit confusing.
Id column value - 2
Your Json value - [{"id":"1", "text":"text1"},{"id":"3", "text":"text3"}]
Expected Result - text1
Solution ->
Step 1 - Iterating through various JSON Elements
select json_extract_array_element_text('[{"id":"1", "text":"text1"},{"id":"3", "text":"text3"}]',1)
Step 2 - Parsing key,Value Pairs in a particular JSON Element
select json_extract_path_text(json_extract_array_element_text('[{"id":"1", "text":"text1"},{"id":"3", "text":"text3"}]',1),'text')
So, Suppose I have a table -
create table dev.gp_test1_20200731
(
id int,
json_list varchar(1000),
expected_result varchar(100)
)
Inserting Data -
insert into dev.gp_test1_20200731
values
(1,'[{"id":"1", "text":"text1"},{"id":"3", "text":"text3"}]', 'text1'),
(2,'[{"id":"2", "text":"text2"},{"id":"3", "text":"text3"}]', 'text2')
How does the data look like -
This is how the query would be -
select json_extract_path_text(json_extract_array_element_text(json_list,1),'text')
from dev.gp_test1_20200731
where id = 2
Result -
But, It is not a good practice of storing JSON's in Redshift.
Documentation on why - Link

Google BigQuery - Parsing string data from a Bigquery table column

I have a table A within a dataset in Bigquery. This table has multiple columns and one of the columns called hits_eventInfo_eventLabel has values like below:
{ID:AEEMEO,Score:8.990000;ID:SEAMCV,Score:8.990000;ID:HBLION;Property
ID:DNSEAWH,Score:0.391670;ID:CP1853;ID:HI2367;ID:H25600;}
If you write this string out in a tabular form, it contains the following data:
**ID | Score**
AEEMEO | 8.990000
SEAMCV | 8.990000
HBLION | -
DNSEAWH | 0.391670
CP1853 | -
HI2367 | -
H25600 | -
Some IDs have scores, some don't. I have multiple records with similar strings populated under the column hits_eventInfo_eventLabel within the table.
My question is how can I parse this string successfully WITHIN BIGQUERY so that I can get a list of property ids and their respective recommendation scores (if existing)? I would like to have the order in which the IDs appear in the string to be preserved after parsing this data.
Would really appreciate any info on this. Thanks in advance!
I would use combination of SPLIT to separate into different rows and REGEXP_EXTRACT to separate into different columns, i.e.
select
regexp_extract(x, r'ID:([^,]*)') as id,
regexp_extract(x, r'Score:([\d\.]*)') score from (
select split(x, ';') x from (
select 'ID:AEEMEO,Score:8.990000;ID:SEAMCV,Score:8.990000;ID:HBLION;Property ID:DNSEAWH,Score:0.391670;ID:CP1853;ID:HI2367;ID:H25600;' as x))
It produces the following result:
Row id score
1 AEEMEO 8.990000
2 SEAMCV 8.990000
3 HBLION null
4 DNSEAWH 0.391670
5 CP1853 null
6 HI2367 null
7 H25600 null
You can write your own JavaScript functions in BigQuery to get exactly what you want now: http://googledevelopers.blogspot.com/2015/08/breaking-sql-barrier-google-bigquery.html

Custom sorting (order by) in PostgreSQL, independent of locale

Let's say I have a simple table with two columns: id (int) and name (varchar). In this table I store some names which are in Polish, e.g.:
1 | sępoleński
2 | świecki
3 | toruński
4 | Włocławek
Now, let's say I want to sort the results by name:
SELECT * FROM table ORDER BY name;
If I have C locale, I get:
4 | Włocławek
1 | sępoleński
3 | toruński
2 | świecki
which is wrong, because "ś" should be after "s" and before "t". If I use Polish locale (pl_PL.UTF-8), I get:
1 | sępoleński
2 | świecki
3 | toruński
4 | Włocławek
which is also not what I want, because I would like names starting with capital letters to be first just like in C locale, like this:
4 | Włocławek
1 | sępoleński
2 | świecki
3 | toruński
How can I do this?
If you want a custom sort, you must define some function that modifies your values in some way so that the natural ordering of the modified values fits your requirement.
For example, you can append some character or string it the value starts with uppercase:
CREATE OR REPLACE FUNCTION mysort(text) returns text IMMUTABLE as $$
SELECT CASE WHEN substring($1 from 1 for 1) =
upper( substring($1 from 1 for 1)) then 'AAAA' || $1 else $1 END
;
$$ LANGUAGE SQL;
And then
SELECT * FROM table ORDER BY mysort(name);
This is not foolprof (you might want to change 'AAA' for something more apt) and hurts performance, of course.
If you want it efficient, you'll need to create another column that "naturally" sorts correctly (e.g. even in the C locale), and use that as a sorting criterion. For that, you should use the approach of the strxfrm C library function. As a straight-forward strxfrm table for your approach, replace each letter with two ASCII letters: 's' would become 's0' and 'ś' would become 's1'. Then 'świecki' becomes 's1w0i0e0c0k0i0', and the regular ASCII sorting will sort it correctly.
If you don't want to create a separate column, you can try to use a function in the where clause:
SELECT * FROM table ORDER BY strxfrm(name);
Here, strxfrm needs to be replaced with a proper function. Either you write one yourself, or you use the standard translate function (although this doesn't support replacing a character with two of them, so you'll need some more involved transformation).