Alternate workaround for Lazy regular expression in snowflake since this feature is not available in snowflake - sql

I am trying to parse the "name" and "address" from a string. I have written the regex pattern which works perfectly fine (I verified in regex101.com) with the 'ungreedy/lazy' feature of regex but not with the greedy. Here is my snowflake query:
select
TRIM(REGEXP_SUBSTR(column1,'(^\\D*)((\\bP[OST]*[ .]*O[FFICE]*[ .]*B[OX]*[ .]*\\d+.*)|(\\d+.*))[,.\\s]+([a-zA-Z]{2})[,.\\s]+(\\d{5}|\\d{5}-\\d{4})$',1,1,'is',1)) as test
from values(TRIM('FIRST SECOND THIRD PO BOX 123 DUMMY XX 12345-6789'));
--please ignore the latter part of regex as I am fetching territory code and zip code also and they are working fine.
The above query is returning me "FIRST SECOND THIRD PO BOX"
And, if I return the 2nd group it returns me "123 DUMMY"
What I want:
case 1 - when my string is 'FIRST SECOND THIRD PO BOX 123 DUMMY XX 12345-6789'
output of 1st group: "FIRST SECOND THIRD"
output of 2nd group: "PO BOX 123 DUMMY"
case 2 - WHEN my string is 'FIRST SECOND THIRD FOURTH FIFTH 123 DUMMY XX 12345-6789'
output of 1st group: "FIRST SECOND THIRD FOURTH FIFTH"
output of 2nd group: "123 DUMMY"
Please suggest workaround here in snowflake since it doesn't have lazy feature.
PS. If you want to verify in regex101, paste the below code and test string. You will see the result when you switch to Ungreedy.
(^\D*)((\bP[OST][ .]O[FFICE][ .]B[OX][ .]\d+.)|(\d+.))[,.\s]+([a-zA-Z]{2})[,.\s]+(\d{5}|\d{5}-\d{4})$
Test String: FIRST SECOND THIRD PO BOX 123 DUMMY XX 12345-6789
Thanks

Writing a JavaScrip UDF is always an option, and then you can use your regex unchanged:
create or replace function parse_address(F STRING)
returns VARIANT
language JAVASCRIPT
immutable
as $$
const regex = /(^\D*)((\bP[OST]*[ .]*O[FFICE]*[ .]*B[OX]*[ .]*\d+.*)|(\d+.*))[,.\s]+([a-zA-Z]{2})[,.\s]+(\d{5}|\d{5}-\d{4})$/gm;
let m = regex.exec(F);
return [m[1], m[2]];
$$;
Usage:
select parse_address($1)
from values('FIRST SECOND THIRD PO BOX 123 DUMMY XX 12345-6789')
, ('FIRST SECOND THIRD FOURTH FIFTH 123 DUMMY XX 12345-6789')
;

Related

Search for any of a list of strings inside another string

I need to identify records with valid addresses by comparing the address fields against a list of street-like words.
So the code would look something like:
set street_list = 'STREET', 'ROAD', 'AVENUE', 'DRIVE', 'WAY', 'PLACE' (etc.)
;
create table [new table] as
select *
from [source table]
where [address line 1] (contains any word from STREET_LIST) or
[address line 2] (contains any word from STREET_LIST) or
[address line 3] (contains any word from STREET_LIST)
;
Is this possible?
Using LostReality's regexp suggestion, I got as far as:
select *
from [source table]
where upper([address line 1]) regexp '.* STREET.*|.* ST.*|.* ROAD.*|.* RD.*|.* CLOSE.*|.* LANE.*|.* LA.*|.* AVENUE.*|.* AVE.*|.* DRIVE.*|.* DR.*|.* HOUSE.*|.* WAY.*|.* PLACE.*|.* SQUARE.*|.* WALK.*|.* GROVE.*|.* GREEN.*|.* PARK.*|.* PK.*|.* CRESCENT.*|.* TERRACE.*|.* PARADE.*|.* GARDEN.*|.* GARDENS.*|.* COURT.*|.* COTTAGES.*|.* COTTAGE.*|.* MEWS.*|.* ESTATE.*|.* RISE.*|.* FARM.*'
;
and it seems to work.
But I have two small problems with it:
1) how do I write the regexp on more than one line so it's easier to read?
2) is there any way of putting that regexp into a macro variable because I want to check 5 address lines and I don't want 5 copies of the same expression.
Thanks
Solution for Hive. You can put regexp pattern in the variable and also you can use macro, fixed your template:
set hivevar:street_list ='STREET|ST|ROAD|RD|CLOSE|LANE|LA|AVENUE|AVE|DRIVE|DR|HOUSE|WAY|PLACE|SQUARE|WALK|GROVE|GREEN|PARK|PK|CRESCENT|TERRACE|PARADE|GARDEN|GARDENS|COURT|COTTAGES|COTTAGE|MEWS|ESTATE|RISE|FARM';
--boolean macro for using in the WHERE
create temporary macro contains_word(s string) (upper(s) rlike ${hivevar:street_list} ) ;
with some_table as ( --use your table instead of this synthetic example
select stack(2,'some string containing STREET and WALK',
'some string containing something else') as str
) --use your table instead of this synthetic example
--use macro in your query
select str from some_table
where contains_word(str);
Result:
OK
some string containing STREET and WALK
Time taken: 0.229 seconds, Fetched: 1 row(s)
Use OR like in your question:
where contains_word(address_line_1) OR contains_word(address_line_2) ...
Hope you have got the idea

SQL: run only one of two stataments if an internal condition is met

Don't know if someone here knows Tasker Android app, but I think everyone could globally understand what I'm looking to accomplish, because I will basically talk about "raw" SQL code, as it's written on most common languages.
First, this is what I want, roughly:
IF (SELECT * FROM ("january") WHERE ("day") = (19)) MATCHES [%records(#) = 1] END
ELSE
SELECT * FROM ("january") WHERE ("day") = (19) ORDER BY ("timea") DESC END
What I want to say above is: If in the first part of the code (IF ... END) the number of the resulting records, matching the number 19 on 'day' column, is just one, end execution here; but if more than one record is found, jump to the next part, after ELSE.
And if you are a Tasker user, you will understand the next (my current) setup:
A1: SQL Query [ Mode:Raw File:Tasker/Resources/Calendar Express/calendar_db Table:january Columns:day Query:SELECT * FROM ("january") WHERE ("day") = (19) Selection Parameters: Order By: Output Column Divider: Variable Array:%records Use Root:Off ]
A2: SQL Query [ Mode:Raw File:Tasker/Resources/Calendar Express/calendar_db Table:january Columns:day Query:SELECT * FROM ("january") WHERE ("day") = (19) ORDER BY ("timea") DESC Selection Parameters: Order By: Output Column Divider: Variable Array:%records Use Root:Off ] If [ %records(#) > 1 ]
An:...
So, as you can see, A1 will run always, without exceptions, getting the result in the variable array '%records()' (% is how Tasker identifies vars, as $ in other langs; and the use of parenthesis rather than brackets). Then, if the number of entries inside the array is just one, A2 will be jumped (if %records(#) > 1), and following actions are executed.
But, if after running A1 the %records() array contains 3, A2 action will be executed overwritting the content of %records() array, previoulsy set. But this time will contain same number of records (3), but reordered.
Is possible to do so, in just one code line? Thanks ;)
As 'sticky bit' replied on a comment before, I can just still using the second action, as it won't affect the output if it's only a single record. Solved!

Foxpro String Variable combination in Forloop

As in title, there is an error in my first code in FOR loop: Command contains unrecognized phrase. I am thinking if the method string+variable is wrong.
ALTER TABLE table1 ADD COLUMN prod_n c(10)
ALTER TABLE table1 ADD COLUMN prm1 n(19,2)
ALTER TABLE table1 ADD COLUMN rbon1 n(19,2)
ALTER TABLE table1 ADD COLUMN total1 n(19,2)
There are prm2... until total5, in which the numbers represent the month.
FOR i=1 TO 5
REPLACE ALL prm+i WITH amount FOR LEFT(ALLTRIM(a),1)="P" AND
batch_mth = i
REPLACE ALL rbon+i WITH amount FOR LEFT(ALLTRIM(a),1)="R"
AND batch_mth = i
REPLACE ALL total+i WITH sum((prm+i)+(rbon+i)) FOR batch_mth = i
NEXT
ENDFOR
Thanks for the help.
There are a number of things wrong with the code you posted above. Cetin has mentioned a number of them, so I apologize if I duplicate some of them.
PROBLEM 1 - in your ALTER TABLE commands I do not see where you create fields prm2, prm3, prm4, prm5, rbon2, rbon3, etc.
And yet your FOR LOOP would be trying to write to those fields as the FOR LOOP expression i increases from 1 to 5 - if the other parts of your code was correct.
PROBLEM 2 - You cannot concatenate a String to an Integer so as to create a Field Name like you attempt to do with prm+i or rbon+1
Cetin's code suggestions would work (again as long as you had the #2, #3, etc. fields defined). However in Foxpro and Visual Foxpro you can generally do a task in a variety of ways.
Personally, for readability I'd approach your FOR LOOP like this:
FOR i=1 TO 5
* --- Keep in mind that unless fields #2, #3, #4, & #5 are defined ---
* --- The following will Fail ---
cFld1 = "prm" + STR(i,1) && define the 1st field
cFld2 = "rbon" + STR(i,1) && define the 2nd field
cFld3 = "total" + STR(i,1) && define the 3rd field
REPLACE ALL &cFld1 WITH amount ;
FOR LEFT(ALLTRIM(a),1)="P" AND batch_mth = i
REPLACE ALL &cFld2 WITH amount ;
FOR LEFT(ALLTRIM(a),1)="R" AND batch_mth = i
REPLACE ALL &cFld3 WITH sum((prm+i)+(rbon+i)) ;
FOR batch_mth = i
NEXT
NOTE - it might be good if you would learn to use VFP's Debug tools so that you can examine your code execution line-by-line in the VFP Development mode. And you can also use it to examine the variable values.
Breakpoints are good, but you have to already have the TRACE WINDOW open for the Break to work.
SET STEP ON is the Debug command that I generally use so that program execution will stop and AUTOMATICALLY open the TRACE WINDOW for looking at code execution and/or variable values.
Do you mean you have fields named prm1, prm2, prm3 ... prm12 that represent the months and you want to update them in a loop? If so, you need to understand that a "fieldName" is a "name" and thus you need to use a "name expression" to use it as a variable. That is:
prm+i
would NOT work but:
( 'pro'+ ltrim(str(m.i)) )
would.
For example here is your code revised:
For i=1 To 5
Replace All ('prm'+Ltrim(Str(m.i))) With amount For Left(Alltrim(a),1)="P" And batch_mth = m.i
Replace All ('rbon'+Ltrim(Str(m.i))) With amount For Left(Alltrim(a),1)="R" And batch_mth = m.i
* ????????? REPLACE ALL ('total'+Ltrim(Str(m.i))) WITH sum((prm+i)+(rbon+i)) FOR batch_mth = i
Endfor
However, I must admit, your code doesn't make sense to me. Maybe it would be better if you explained what you are trying to do and give some simple data with the result you expect (as code - you can use FAQ 50 on foxite to create code for data).

PowerPivot - Dax Find Text in another Cell - Cannot get string value

I have what is simple data and I want to return the Home or Motor if a find search returns True.
For Example
Skills Type
I Home Sr
A Mot Pre
Type is my custom column
Starting with just returning True and it fails right here with
=FIND("Home",[Skills])
With
Calculation error in column 'Table1'[]: The search Text provided to
function 'FIND' could not be found in the given text.
Ultimately I want to use If Find is "Home" Return Home if "Motor" return Motor
Desired Output (please note there are other starting variations to the Skills so cannot use a fixed search point in text)
Skills Type
I Home Sr Home
A Mot Pre Motor
Use the following expression for Type column:
=
IF (
IFERROR ( SEARCH ( "Mot*", [Skills], 1, 0 ), 0 ),
"Motor",
IF ( IFERROR ( SEARCH ( "Hom*", [Skills], 1, 0 ), 0 ), "Home", "Nothing" )
)
It will generate Home or Motor for any occurrence of Hom* and Mot*, note the * wildcard
It should produce:
The screenshot shows the table generated in Power BI, but this solution works in PowerPivot too, I don't have access to PowerPivot in this moment.
Note if any occurrence is not found it will put "Nothing" in your
Type column so you can replace "Nothing" by the string you want to
appear in that case.
SEARCH function documentation can be read here.
Let me know if this helps.

Pig Latin - Extracting fields meeting two different filter criteria from chararray line and grouping in a bag

I am new to Pig Latin.
I want to extract all lines that match a filter criteria (have a word "line_token" ) from log files and then from these matching lines extract two different fields meeting two separate field match criteria . Since the lines aren't structured well I am loading them as a char array.
When I try to run the following code - I get an error
"Invalid resource schema: bag schema must have tuple as its field"
I have tried to perform an explicit cast to a tuple but that does not work
input_lines = LOAD '/inputdir/' AS ( line:chararray);
filtered_lines = FILTER input_lines BY (line MATCHES '.*line_token1.*' );
tokenized_lines = FOREACH filtered_lines GENERATE FLATTEN(TOKENIZE(line)) AS tok_line;
my_wordbag = FOREACH tokenized_lines {
word1 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
word2 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
GENERATE word1 , word2 as my_tuple ;
-- I also tried --> GENERATE (word1 , word2) as my_tuple ;
}
dump my_wordbag;
I suppose I am taking a very wrong approach.
Please note - my logs aren't structured well - so I cant mend the way I load
Post loading and initial filtering for lines of interest ( which is straightforward) - I guess I need to do something different rather than tokenize line and iterate through fields trying to find fields.
Or maybe I should use joins ?
Also if I know the structure of line beforehand well as all text fields, then will loading it differently ( not as a chararray) make it an easier problem ?
For now I made a compromise - I added a extra filter clause in my original - line filter and settled for picking just one field from line. When I get back to it I will try with joins and post that code ... - here's my working code that gets me a useful output - but not all that I want.
-- read input lines from poorly structured log
input_lines = LOAD '/log-in-dir-in-hdfs' AS ( line:chararray) ;
-- Filter for line filter criteria and date interested in passed as arg
filtered_lines = FILTER input_lines BY (
( line MATCHES '.*line_filter1*' )
AND ( line MATCHES '.*line_filter2.*' )
AND ( line MATCHES '.*$forDate.*' )
) ;
-- Tokenize every line
tok_lines = FOREACH filtered_lines
GENERATE TOKENIZE(line) AS tok_line;
-- Pick up specific field frm tokenized line based on column filter criteria
fnames = FOREACH tok_lines {
fname = FILTER tok_line BY ( $0 MATCHES '.*field_selection.*' ) ;
GENERATE FLATTEN(fname) as nnfname;
}
-- Count occurances of that field and store it with field name
-- My original intent is to store another field name as well
-- I will do that once I figure how to put both of them in a tuple
flgroup = FOREACH fnames
GENERATE FLATTEN(TOKENIZE((chararray)$0)) as cfname;
grpfnames = group flgroup by cfname;
readcounts = FOREACH grpfnames GENERATE COUNT(flgroup), group ;
STORE readcounts INTO '/out-dir-in-hdfs';
As I understand, after the FLATTEN operation, you have single line (tok_line) in each row and you want to extract 2 words from each line. REGEX_EXTRACT will help you achieve this. I'm not a REGEX expert so will leave writing the REGEX part up to you.
data = FOREACH tokenized_lines
GENERATE
REGEX_EXTRACT(tok_line, <first word regex goes here>) as firstWord,
REGEX_EXTRACT(tok_line, <second word regex goes here>) as secondWord;
I hope this helps.
You must refer to the alias, not the column.
So:
word1 = FILTER tokenized_lines BY ( $0 MATCHES '.*word_token1.*' ) ;
word1 and word2 are going to be aliases as well, not columns.
How do you need the output to look like?