hive or impala function to get substring of a string - hive

My string(its a hive query) is having many FROM and JOIN statements and i want to use Regex function to get all the sub-strings after these statements.
Below is the sample string:
str=
'select col1, col2, col3 from dbname.table1,table2
left JOIN table3
on id=id
cross JOIN table4
where filter='check'
AND row<1
AND id=5'
Required output should be:
Ex:
select Regex(str,'from ') => dbname.table1,table2
select Regex(str,'JOIN ') => table3 table4

You can use the following regular expression to capture the tables followed by FROM or JOIN keyword.
((JOIN|join|From|from)\s)\w+((\.|,)\w+){0,}
Note that I have used keywords in simple and capital format. You can use only one format if the query string is consistent with the regex or you can do a case insensitive match.
The above regex will give the following result.
Case 1 : From
Full Match: from dbname.table1,table2
Match Group1: from (note the space at the end)
Case 2 : Join
Full Match: JOIN table3 and JOIN table4
Match Group1: JOIN (note the space at the end)
On every match, now you can use match group1 result to replace the unwanted prefix (from or JOIN ) from the full match result to get the table names.
Use this site to play and learn regex: https://regex101.com/
EDIT 1
In hive
regexp_extract('fooblabar', 'foo(.*?)(bar)', 1)
will give you the first group. In this case, it's bla
EDIT 2
Small update on the regular expression to capture the result in group3
((JOIN|join|From|from)\s)(\w+((\.|,)\w+){0,})
This should do the trick
select split(trim(regexp_replace('select Id from test1 where join test2','((JOIN|join|From|from)\s)(\w+((\.|,)\w+){0,})',' $3')),' ');

Related

WHERE rows in col1 that are present as a substring of col2 in Snowflake

I have two columns in separate tables structured as follows.
table1.col1: Rows are one word strings.
table2.col2: Rows are many word strings.
I would like to SELECT all the rows of col1 that are present as a substring of any row in col2.
For example I would want to keep the row containing "fox" in table1.col1 if any rows in table2.col2 have the string "the quick brown fox", but remove the row containing "xyz" if it is not present in any rows in table2.col2.
I know you can use LIKE to compare to one string, but not sure how to compare to a column of strings. Thanks!
LIKE could be used at JOIN condition level:
SELECT *
-- DISTINCT table1.* -- if only rows from table1 are required
FROM table2 -- many words string
JOIN table1 -- single terms
ON table2.col2 ILIKE CONCAT('%', table1.col1, '%');

PostgreSQL: Joining Tables Based on Searched Concatenated Strings

I'm not sure how to write a join clause that takes a value from table 1, then searches a string in table 2 to see if they match. Sound confusing?
Here's the actual example I'm working with.
Table 1
Customer_Id Concat_Phone_Numbers
1 8888888888;1111111111
Table 2
Caller Callee Calldate
1111111111 3333333333 1/1/1900
I want to create a table that looks like this:
Desired Table
Customer_Id Calldate
1 1/1/1900
I'm lost when it comes to writing the join clause so that the entire list in Table 1's second column is searched for a matching phone number/entry.
Thank you in advance for your help! (PS it's my first time asking a question!)
Edit::
Here's where I'm at now
Select
*
from table1
left join table2
on ??????????????????
Yuck! You should fix the data structure. You really need a table with one row per customer and per phone number. You'll understand why if you care about performance.
But, if you are stuck with this data model, you can do a join using string and/or array operations. Here is a method using regular expressions
select . . .
from table1 t1 left join
table2 t2
on t2.caller ~ '^' || replace(t1.phone_numbers, ';', '|') || '$' or
t2.callee ~ '^' || replace(t1.phone_numbers, ';', '|') || '$' ;

CharIndex returning null when I know there is overlap in two strings

I have created a new column in my table(table1) . I am trying to populate it with data from another table, table2.
Table1 has a column called 'Name'. 'Name' contains a substring indicating the language of the column. I wish to compare this substring with the 'Language' column of table2, which contains the substring in the name column and insert the corresponding LanguageID into my new column.
So, for instance :
table1
Name
xxXxxxXxxxxxzxzxzxz xxxazxzxxXXXZxxzxzx 2183909213 ENG-UK nfjksdnfnd 723984782347
and table2 :
table2
Language | ID
ENG-uk | 1
In the table1 name column, the string before and after the Language can take any form, a varying number of characters. The language will always have a space before and after it.
So, I want to end up with :
table1
Name | LanguageID
xx... | 1
I have this query which I believe should work :
INSERT INTO table1 (LanguageID)
SELECT t2.ID FROM table2 t2, table1 t1 WHERE CHARINDEX(LOWER(t2.Language), LOWER(t1.Name)) != null
The problem is, when I run this...."(0 row(s) affected)", which should not be the case.
Does anyone have any ideas ?
The reason that you don't get any matches at all is that you can't use the != operator to compare null values, you have to use is not null for that.
However, that will give you a very big result, as the return value from charindex is never null. When the string isn't found it returns zero, so that is what you should compare against.
Also, you can't insert columns, you have to first add the column to the table, then update the records:
update t1
set LanguageID = t2.ID
from table1 t1
inner join table2 t2 on charindex(lower(t2.Language), lower(t1.Name)) != 0
1st, CHARINDEX returns 0 when search string does not exist. It doesn't do null.
See http://msdn.microsoft.com/en-us/library/ms186323.aspx
2nd, I think you should UPDATE, not INSERT.
example (this won't work correctly if t2.Language is not UNIQUE):
UPDATE table1 t1
SET t1.LanguageID = (SELECT t2.ID from table2 t2 where CHARINDEX(LOWER(t1.Name), LOWER(t2.Language))>0)
where exists (SELECT t2.ID from table2 t2 where CHARINDEX(LOWER(t1.Name), LOWER(t2.Language))>0)
You should consider adding individual columns for the first table information. Otherwise, you will end up with Performance Issues due to the SubString Operation.
It's clear from the first table that the table schema is not Normalized. Moreover, the First table schema is not suitable for any Search/Sorting operations.

Pattern matching against two tables

I have two tables with lists of files: table1 has column1 with file names that begin with the string 'STO-' followed by a string pattern. So in total, the string has 16 alphanumberic characters (with dashes its a 20-character string).
This is similar to the same file name string found in table2, column1. The issue, however, is that in the first table there are also additional text and characters appended to that 20-digit string. I'm attempting to match results from both tables where those 20-digit character strings match, along with additional information from the table. I've found plenty of information about pattern matching within a table, but not comparing two tables. Hopefully I'm explaining myself and can provide an example to help:
TABLE1.Column1 contains a file name 'STO-100-XX-XXXX-XXXX_Text.pdf '
TABLE2.Column2 contains a file name 'STO-100-XX-XXXX-XXXX.pdf' and TABLE2.Column3='Y'
So again, I'm trying to see the list of files from TABLE1 that where the first 20 alphanumeric character string has a match from TABLE2.
SELECT * from TABLE1 t1
INNER JOIN TABLE2 t2
ON SUBSTRING(t1.Column1, 1, 20) = SUBSTRING(t2.Column2, 1, 20)
(tested on SQL Server 2005, but I believe SUBSTRING is an ANSI SQL function, so should work on most databases).
Also, it's a little unclear from your question, but if you additionally wanted to restrict the results based on column3, you would simply do
SELECT * from TABLE1 t1
INNER JOIN TABLE2 t2
ON SUBSTRING(t1.Column1, 1, 20) = SUBSTRING(t2.Column2, 1, 20)
WHERE t2.Column3 = 'Y'

MS SQL 2005 compare field containing square parenthesis

I am using MS SQL Server 2005 (9.0.4035) and trying to find rows that contain the same data in a nvarchar(4000) field. The field contains xml that has both opening and closing square parentheses.
Here is sample data:
DataID Data
1 1
2 1
3 2]
4 2]
5 3[
6 3[
Using the 'like' operator I expected to get 3 matching pairs, but my problem is that row 5 and 6 do not match each other, I only get back that rows 1 & 2 match, and 3 & 4 match.
I know MS SQL 2005 added regular expression support in queries but I did not expect them to evaluate field data as a regular expression, which I think it is doing. Is there a mode that I need to enable to get the proper results?
Any help appreciated,
Ryan
Edit: Added sql statement used:
Select t1.DataID, t2.DataID From TestTable t1, TestTable t2
Where t1.DataID <> t2.DataID
and t1.Data like t2.Data
Edit: Answer
Using '=' operator works, but escaping the '[' does not.
Change your query to use = instead of LIKE and you'll get the results that you expect. SQL 2005 T-SQL won't do regex - you'd need to use CLR functions for that - but the LIKE statment does do pattern matching. '[' and ']' are reserved for the pattern matching in a like statment, and you'd have to escape them out if you intended for them to be equality matches.
See http://msdn.microsoft.com/en-us/library/ms179859.aspx for info on the LIKE statement.
Either of the 2 queries below solved the problem in my tests...
--using equals operator...
Select t1.DataID, t2.DataID From TestTable t1, TestTable t2
Where t1.DataID <> t2.DataID
and t1.Data = t2.Data
--using replace to add an escape character.
Select t1.DataID, t2.DataID From TestTable t1, TestTable t2
Where t1.DataID <> t2.DataID
and t1.Data like REPLACE(t2.Data, '[', '\[') escape '\'