SQL: Extract from messy JSON nested field with backslashes - sql

I have a table that has some rows with normal JSON and some with escaped values in the JSON field (backslashes)
id
obj
1
{"is_from_shopping_bag":true,"products":[{"price":{"amount":"18.00","currency":"USD","offset":100,"amount_with_offset":"1800"},"product_id":"1234","quantity":1}],"source":"cart"}
2
{"is_from_shopping_bag":"","products":"[{\ "product_id\ ":\ "2345\ ",\ "price\ ":{\ "currency\ ":\ "USD\ ",\ "amount\ ":\ "140.00\ ",\ "offset\ ":100},\ "quantity\ ":1}]"}
(Note: I needed to include a space after the backslashes in the above table so that they would show up in the github generated markdown table -- my actual table does not include those spaces between the backslash and the quote character)
I am doing a sql query in Hive to get the 'currency' field.
Currently I can run
SELECT
id,
JSON_EXTRACT(obj, '$.products[0].price.currency')
FROM my_table
Which will give me the correct output for the first row, but gives me a NULL in the second row
id
obj
1
"USD"
2
NULL
What is the best way to get currency field from the second row? Is there a way to clean up the field and remove the backslashes before trying to JSON_EXTRACT the relevant data?
I could use REPLACE to swap the '\ ' for '', but is that the most efficient method?

Replace \" with " using regexp_replace like this:
regexp_replace(obj,'\\\\"','"')

Related

How to filter String in where clause

I would like to extract the string using where clause in SAP HANA.For an example,these are 3 strings for name column.
123._SYS_BIC.meag.app.qthor.cidwh_eingangsschicht.backend.dblayer.l2.checks/MasterData_Holdings.
153._SYS_BIC.meag.app.qthor.centralAdministration.backend.dblayer.l2.checks/AuditAndSecurities.
meag.app.qthor.centralAdministration.backend.dblayer.l2.checks/GeneralLedger
After filter the name column using where clause, output in the name column would be shown only the last portion of the string. So, output will be like this. That means whatever we have, just remove from the beginning till '/'.
"MasterData_Holdings"
"AuditAndSecurities"
"GeneralLedger"
You can try using the REPLACE_REGEXPR
I'm not familiar myself with Hana but the function is pretty straight forward and it should be:
select REPLACE_REGEXPR('.+/(.+)' IN fieldName WITH '\1' OCCURRENCE ALL) as field
...
where
... -- your filter
Be aware that this regex '.+/(.+)' will eat everything until the last / so for instance if you have ....checks/MasterData_Holdings/Something it will return only Something

Oracle SQL - How to Select a substring index inside another Query?

I'm writing a query that returns a bunch of things from multiple tables. The main query is against Table_1. I need to return a substring from a field in table 7. But I'm getting an error that Substring_Index is an invalid identifier. How can I achieve the intended result?
I have a field COLUMN_1 of TABLE_1 that has 3+ pieces of data, separated by " : " (space colon space) and I need to strip out the text before the first delimiter, and return the rest of it (regardless of length).
A simplified example:
SELECT t1.name
,t1.address
,t1.phone
,t2. fave_brand
,SUBSTRING_INDEX(t3.fave_product, ' : ', -1) AS Fave Product
FROM table_1 t1
INNER JOIN table_2 t2
ON t2.brand_SK = t1.fave_brand_FK
INNER JOIN table_3 t3
ON t3.product_list_SK = t1.fave_products
WHERE <a series of constraints>;
Please note, I am NOT normally an SQL developer, but the back-end dev is on vacation and I've been tasked with cobbling this fix together. I'm a beginner at best.
In oracle you could use regexp_replace():
regexp_replace(t3.fave_product, '^[^:]*:', '') "Fave Product"
regexp_replace() replaces the part of the string that matches the regexp given as second argument with the value given as third argument. Here, we use the empty string as third argument, meaning that the matching part of the string is suppressed.
Regexp breakdown:
^ beginning of the string
[^:]* as many characters as possible other than ":" (possibly, 0 characters)
: character ":"
NB: identifiers that contain special characters (such as space) need to be double quoted.
Oracle does not support substring_index(). That is a MySQL function.
You can use regexp_substr(). Without sample data it is a little hard to be 100% sure, but I think the logic you want is:
regexp_substr(t3.fave_product, '[^:]+$') as fave_product

regex_replace to append to end of line?

I have a postgres table which contains rows that each hold multiple lines of text (split by new lines), for example...
The table name is formats, column is called format, an example format (1 table row) would look like the following:
list1=text1;
list2=text2;
list3=text3;
etc etc
I would like a way to identify the list2 string and then append additional text to the end of the same line.
So the outcome would be:
list1=text1;
list2=test2;additionaltext
list3=text3;
I have tried the below to try and pull in the 'capture string' into the replace string but have been unsuccessful so far.
regexp_replace(format, 'list2=.*', '\1 additionaltext','n');
To capture a pattern, you must enclose it in parenthesis.
regexp_replace(format, '(list2=.*)', '\1additionaltext', 'n')

Regular expression to remove element not match specific prefix

I am doing this in Impala or Hive. Basically let say I have a string like this
f-150:aa|f-150:cc|g-210:dd
Each element is separated by the pipe |. Each has prefix f-150 or whatever. I want to be able to remove the prefix and keep only element that matches specific prefix. For example, if the prefix is f-150, I want the final string after regex_replace is
aa|cc
dd is removed because g-210 is different prefix and not match, therefore the whole element is removed.
Any idea how to do this using string expression in one SQL?
Thanks
UPDATE 1
I tried this in Impala:
select regexp_extract('f-150:aa|f-150:cc|g-210:dd','(?:(?:|(\\|))f-150|keep|those):|(?:^|\\|)\\w-\\d{3}:\\w{2}',0);
But got this output:
f-150:aa
In Hive, I got NULL.
The regexyou in question could look like this:
(?:(?:|(\\|))f-150|keep|those):|(?:^|\\|)\\w-\\d{3}:\\w{2}
I have added some pseudo keywords to retain, but I am sure you get the idea:
Wholy match elements that should be dropped but only match the prefix for those that should be retained.
To keep the separator intact, match | at the beginning of an element in group 1 and put it back in the replacement with $1.
Demo
According to the documentation, your query should be written like a Java regex; likewise, this should perform like this code sample in Java.
You could match the values that you want to remove and then replace with an empty string:
f-150:|\|[^:]+:[^|]+$|[^|]+:[^|]+\|
f-150:|\\|[^:]+:[^|]+$|[^|]+:[^|]+\\|
Explanation
f-150: Match literally
| Or
\|[^:]+:[^|]+$ Match a pipe, not a colon one or more times followed by not a pipe one or more times and assert the end of the line
| Or
[^|]+:[^|]+\| Match not a pipe one or more times, a colon followed by matching not a pipe one or more times and then match a pipe
Test with multiple lines and combinations
You may have to loop through the string until the end to get the all the matching sub string. Look ahead syntax is not supported in most sql so above regexp might not be suitable for SQL syntax. For you purpose you can do something like creating a table to loop through just to mimic Oracle's level syntax and join with your table containing the string.
With loop_tab as (
Select 1 loop union all
Select 2 union all
select 3 union all
select 4 union all
select 5),
string_tab as(Select 'f-150:aa|ade|f-150:ce|akg|f-150:bb|'::varchar(40) as str)
Select regexp_substr(str,'(f\\-150\\:\\w+\\|)',1,loop)
from string_tab
join loop_tab on 1=1
Output:
regexp_substr
f-150:aa|
f-150:ce|
f-150:bb|

Regex Postgres More than one dot

I need to return the fields that have more than one . in a specific column.
Now I have this query:
select *
from table
where column ~ '\.{2,}?';
But for some reason it returns nothing. If I use something like 'A{2,}?' it works. Apparently the problem is the dot.
It returns null since the dots are not next two each other. You have to consider the occurrences of the characters in the order of your regex meta characters. You could try this instead:
select *
from table
where column ~ '\.\d{3}\.';
Or instead of just focusing on the dot characters start parsing the string as a whole and consider the numbers as well:
where column ~ '^\d{3}\.\d{3}\.';
Why not just use like?
where column like '%.%.%'