How using the regexp_replace function in HIVE can I cut the markup from this string:
Abc abc ","abc abc abc .
I want to get: Abc abc abc abc abc
Does anyone know?
Assuming column WTF contains
Abc abc ","abc abc abc .
then regexp_replace(regexp_replace(WTF,'<[^>]*>',''), '[",.]','') removes all XML markup stuff, then punctuation, to return
Abc abc abc abc abc
That's plain old regular expression syntax, nothing specific to Hive.
Related
Let's say my data is like:
abcd abcd aaa 1234 1234566789 abcd abcd aaa 123456789 1234sfjsalfj
what I want to do is:
if a number is 3 to 6 digits and there is aaa in front of it, then I keep it.
I do not need other numbers if no aaa ahead, or the number of digits is out of my range (3-6)
So, this example should be transferred to:
abcd abcd aaa 1234 abcd abcd sfjsalfj
How to do this in Athena SQL? Maybe not a single query. Using with or any other query combination is also fine.
You should be able to use this regex to do the replacement in one query:
(?<!aaa |\d)\d+\s*|(aaa (\d{1,2}(?!\d)|\d{7,})\s*)
This looks for any digits not preceded by aaa , or 1, 2, or >6 digits which are preceded by aaa . Any matches should be replaced by the empty string (using the two parameter version of regexp_replace i.e.
SELECT regexp_replace('abcd abcd aaa 1234 1234566789 abcd abcd aaa 123456789 1234sfjsalfj', '(?<!aaa |\d)\d+\s*|(aaa (\d{1,2}(?!\d)|\d{7,})\s*)')
Regex demo on regex101
The data I have looks like below-
Column_A Column_B
Hello, how are you ABC
Good day DEF
LMN Nice day
PQR Hello
Hi TYU
GHI Hi
Good night STR
RST Night
What I want to do is if the word Hello, Hi are found in Column_A or Column_B or both, I create a new column Type and assign it type X
if the words day, night is found in Column_A or Column_B or both, I assign it type Y.
The result should look like below-
Column_A Column_B Type
Hello, how are you ABC type_X
Good day DEF type_Y
LMN Nice day type_Y
PQR Hello type_X
Hi TYU type_X
GHI Hi type_X
Good night STR type_Y
RST Night type_Y
How to achieve this result?
Below is for BigQuery Standard SQL
#standardSQL
SELECT *,
CASE
WHEN REGEXP_CONTAINS(LOWER(CONCAT(Column_A, ' ', Column_B)), r'hello|hi') THEN 'type_X'
WHEN REGEXP_CONTAINS(LOWER(CONCAT(Column_A, ' ', Column_B)), r'day|night') THEN 'type_Y'
ELSE 'unknown'
END AS Type
FROM `project.dataset.table`
If to apply to sample data from your question - the output is
Row Column_A Column_B Type
1 Hello, how are you ABC type_X
2 Good day DEF type_Y
3 LMN Nice day type_Y
4 PQR Hello type_X
5 Hi TYU type_X
6 GHI Hi type_X
7 Good night STR type_Y
8 RST Night type_Y
A little less verbose version is
#standardSQL
SELECT *,
CASE
WHEN REGEXP_CONTAINS(Column_A_or_B_or_Both, r'hello|hi') THEN 'type_X'
WHEN REGEXP_CONTAINS(Column_A_or_B_or_Both, r'day|night') THEN 'type_Y'
ELSE 'unknown'
END AS Type
FROM `project.dataset.table`,
UNNEST([LOWER(CONCAT(Column_A, ' ', Column_B))]) Column_A_or_B_or_Both
obviously with the same output
I have a column in which each cell contains data in this format:
ABC | DEF | GHI | |
ABC | DEF | GHI | JKL |
ABC | DEF | | |
I need to extract the first and last valid (i.e. not empty) sub-strings.
I can extract the first/last substring easily enough using a formula (though it's clunky):
FIRST SUBSTRING
=TRIM(MID(SUBSTITUTE(A1,"|",REPT(" ",LEN(A1))),(4-4)*LEN(A1)+1,LEN(A1)))
LAST SUBSTRING
=TRIM(MID(SUBSTITUTE(A1,"|",REPT(" ",LEN(A1))),(4-1)*LEN(A1)+1,LEN(A1)))
This basically uses SUBSTITUE to replace the "|" delim with spaces, then uses MID to the extract the nth substring followed by TRIM to replace the extra spaces... but if the last delimited substring is empty it returns an empty string (as its meant to i guess).
How can I modify this formula to extract the last valid substring (i.e. not empty " "). Could someone please show me how to do this using VBA code ?
ABC | DEF | GHI | |
Output column 1: ABC
Output column 2: GHI
ABC | DEF | GHI | JKL |
Output column 1: ABC
Output column 2: JKL
ABC | DEF | | |
Output column 1: ABC
Output column 2: DEF
let's say your worksheet is WS
and your values starts at cell A2 till A120
dim zeValue$, out1$, out2$
dim i as int
for i = 2 to 120
zeValue = replace(replace(WS.range("A" & i),' ',''),'|','')
out1 = left$(zeValue,3)
out2 = right$(WS.range(zeValue, 3)
debug.print('out1 : '+out1)
debug.print('out2 : '+out2)
next i
Not tested but that should work
Good luck pal !
I am working on this database and I have to import this data from an excel sheet. But the problem is the data is Horizontal some what like this!
ID Note1 Note2 Note3
2001 ABC DEF GHI
2002 XYZ NULL NULL
2003 MNO PQR NULL
And I want to add it into my table as
ID Notes
2001 ABC
2001 DEF
2001 GHI
2002 XYZ
2003 MNO
2003 PQR
Is there any way I can insert this horizontal data from Excel into my table in SQL vertically?
Once you receive the data in sql you can use UNPIVOT, clear examples here.
Pretty easy. Arrange into two columns and use import feature...
Is it possible? Im trying to create rdl and want to showcase my report like this.
---------------------------------------
Ship # Loc Ship
---------------------------------------
ABC
630-4144 0 ABC
630-4144 0 ABC