How to use regexp_replace in hive to remove strings

How to use regexp_replace in hive to remove strings - hive

I have a table as:
column1 -> 101#1,102#2,103#3,104#4
I am trying to remove strings (101#,102#,103#,104#). The expected output is
column2 -> 1,2,3,4
I am trying to do using regexp_replace
any help would be highly appreciated

It seems silly but, you have to break the string into an array, then transform (run a function) on each element, and finally concat the array back into a string.
select concat_ws( ',' , transform ( split('101#1,102#2,103#3,104#4',','), x -> regexp_replace( x, '.*#','' )))

Related

Get an element in Json using PostgreSQL

I have this JSON and I want to get this part: '1000000007296871' in SQL
{"pixel_rule":"{\"and\":[{\"event\":{\"eq\":\"Purchase\"}},{\"or\":[{\"content_ids\":{\"i_contains\":\"1000000007296871\"}}]}]}"}
How to do that?
this is JSON Dump

Well, you can do this like so:
SELECT
(
(
(
(
(
'{"pixel_rule":"{\"and\":[{\"event\":{\"eq\":\"Purchase\"}},{\"or\":[{\"content_ids\":{\"i_contains\":\"1000000007296871\"}}]}]}"}'::json->>'pixel_rule'
)::json->>'and'
)::json->1
)::json->>'or'
)::json->0->>'content_ids'
)::json->>'i_contains';
But something is really funny about your input, since it contains json nested in json multiple times.

One option would be smoothing the object by trimming the wrapper quotes of the pixel_rule's value, and then getting rid of redundant backslashes, and applying json_array_elements function consecutively in order to use #>> operator with the related path to extract the desired value :
SELECT json_array_elements
(
json_array_elements(
replace(trim((jsdata #>'{pixel_rule}')::text,'"'),'\','')::json -> 'and'
) -> 'or'
)::json #>> '{content_ids,i_contains}' AS "Value"
FROM tab
Demo

my option is using regex
'\d+' if other part is not digit
select substring('string' from '\d+')

Databricks - String manipulation via sql command

I have a table column that I need to get from databricks whatever appears between the 15th and 16th appearance of the character # as follows the following example:
Column
1234##E#A#1234#01/01/4500#X#*ABCDE#7#1##N#N#N#0#Z.POIUS.LKJS_20200103#0#
Results
Z.POIUS.LKJS_20200103
how can I do this?

select reverse(substring_index(reverse(substring_index('1234##E#A#1234#01/01/4500#X#*ABCDE#7#1##N#N#N#0#Z.POIUS.LKJS_20200103#0#', '#', 16)),'#', 1))

You can just split the string and get the 15th element, eg something like this:
%sql
SELECT *,
regexp_extract( yourCol, '(?:[^#]*(#)){15}(.[^#]+)', 2 ) xregex,
split( yourCol, '#' )[15] AS xsplit
FROM tmp
I was experimenting with regex which may be appropriate for some cases too. My results:

How to get substring based on a character and starting to read the string from the right

I have the following values on a column:
DB3-0800-VRET,
DB3-0800-IC,
IB-TZ-850-IB,
O11FS-OB ...
From each value I want to remove the last part after the dash.
I need to have the following result:
DB3-0800-VRET -> DB3-0800,
DB3-0800-IC -> DB3-0800,
O11FS-OB -> O11FS
I tried to work with the SPLIT_PART function of RedShift but I didn't have any luck.
If someone knows a regex to select the part I need I'd be grateful.

In both Postgres and Redshift, you should be able to use regexp_replace():
select regexp_replace(str, '-[^-]+$', '')

Postgresql: Extracting substring after first instance of delimiter

I'm trying to extract everything after the first instance of a delimiter.
For example:
01443-30413 -> 30413
1221-935-5801 -> 935-5801
I have tried the following queries:
select regexp_replace(car_id, E'-.*', '') from schema.table_name;
select reverse(split_part(reverse(car_id), '-', 1)) from schema.table_name;
However both of them return:
01443-30413 -> 30413
1221-935-5801 -> 5801
So it's not working if delimiter appears multiple times.
I'm using Postgresql 11. I come from a MySQL background where you can do:
select SUBSTRING(car_id FROM (LOCATE('-',car_id)+1)) from table_name

Why not just do the PG equivalent of your MySQL approach and substring it?
SELECT SUBSTRING('abcdef-ghi' FROM POSITION('-' in 'abcdef-ghi') + 1)
If you don't like the "from" and "in" way of writing arguments, PG also has "normal" comma separated functions:
SELECT SUBSTR('abcdef-ghi', STRPOS('abcdef-ghi', '-') + 1)

I think that regexp_replace is appropriate, but using the correct pattern:
select regexp_replace('1221-935-5801', E'^[^-]+-', '');
935-5801
The regex pattern ^[^-]+- matches, from the start of the string, one or more non dash characters, ending with a dash. It then replaces with empty string, effectively removing this content.
Note that this approach also works if the input has no dashes at all, in which case it would just return the original input.

Use this regexp pattern :
select regexp_replace('1221-935-5801', E'^[^-]+-', '') from schema.table_name
Regexp explanation :
^ is the beginning of the string
[^-]+ means at least one character different than -
...until the - character is met

I tried it in a conventional way in general what we do (found
something similar to instr as strpos in postgrsql .) Can try the below
SELECT
SUBSTR(car_id,strpos(car_id,'-')+1,
length(car_id) ) from table ;

Text to List in SQL

Is there any way on how to convert a comma separated text value to a list so that I can use it with 'IN' in SQL? I used PostgreSQL for this one.
Ex.:
select location from tbl where
location in (replace(replace(replace('[Location].[SG],[Location].[PH]', ',[Location].[', ','''), '[Location].[', ''''), ']',''''))
This query:
select (replace(replace(replace('[Location].[SG],[Location].[PH]', ',[Location].[', ','''), '[Location].[', ''''), ']',''''))
produces 'SG','PH'
I wanted to produce this query:
select location from tbl where location in ('SG','PH')
Nothing returned when I executed the first query. The table has been filled with location values 'SG' and 'PH'.
Can anyone help me on how to make this work without using PL/pgSQL?

So you're faced with a friendly and easy to use tool that won't let you get any work done, I feel your pain.
A slight modification of what you have combined with string_to_array should be able to get the job done.
First we'll replace your nested replace calls with slightly nicer replace calls:
=> select replace(replace(replace('[Location].[SG],[Location].[PH]', '[Location].', ''), '[', ''), ']', '');
replace
---------
SG,PH
So we strip out the [Location]. noise and then strip out the leftover brackets to get a comma delimited list of the two-character location codes you're after. There are other ways to get the SG,PH using PostgreSQL's other string and regex functions but replace(replace(replace(... will do fine for strings with your specific structure.
Then we can split that CSV into an array using string_to_array:
=> select string_to_array(replace(replace(replace('[Location].[SG],[Location].[PH]', '[Location].', ''), '[', ''), ']', ''), ',');
string_to_array
-----------------
{SG,PH}
to give us an array of location codes. Now that we have an array, we can use = ANY instead of IN to look inside an array:
=> select 'SG' = any (string_to_array(replace(replace(replace('[Location].[SG],[Location].[PH]', '[Location].', ''), '[', ''), ']', ''), ','));
?column?
----------
t
That t is a boolean TRUE BTW; if you said 'XX' = any (...) you'd get an f (i.e. FALSE) instead.
Putting all that together gives you a final query structured like this:
select location
from tbl
where location = any (string_to_array(...))
You can fill in the ... with the nested replace nastiness on your own.

Assuming we are dealing with a comma-separated list of elements in the form [Location].[XX],
I would expect this construct to perform best:
SELECT location
FROM tbl
JOIN (
SELECT substring(unnest(string_to_array('[Location].[SG],[Location].[PH]'::text, ',')), 13, 2) AS location
) t USING (location);
Step-by-step
Transform the comma-separated list into an array and split it to a table with unnest(string_to_array()).
You could do the same with regexp_split_to_table(). Slightly shorter but more expensive.
Extract the XX part with substring(). Very simple and fast.
JOIN to tbl instead of the IN expression. That's faster - and equivalent while there are no duplicates on either side.
I assign the same column alias location to enable an equijoin with USING.

Directly using location in ('something') works
I have create a fiddle that uses IN clause on a VARCHAR column
http://sqlfiddle.com/#!12/cdf915/1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to use regexp_replace in hive to remove strings - hive

I have a table as: column1 -> 101#1,102#2,103#3,104#4 I am trying to remove strings (101#,102#,103#,104#). The expected output is column2 -> 1,2,3,4 I am trying to do using regexp_replace any help would be highly appreciated

It seems silly but, you have to break the string into an array, then transform (run a function) on each element, and finally concat the array back into a string. select concat_ws( ',' , transform ( split('101#1,102#2,103#3,104#4',','), x -> regexp_replace( x, '.*#','' )))

Related

Get an element in Json using PostgreSQL

Databricks - String manipulation via sql command

How to get substring based on a character and starting to read the string from the right

Postgresql: Extracting substring after first instance of delimiter

Text to List in SQL

Categories

Resources