matching subsequent space separated numbers as different tokens - antlr

In a flat file,for which i'm trying to write a parser, there is a line like this:
//TN PN RO
0 5 3
TN,PN and RO are the parameter names (i have added here the line starting with "//" for better understanding. The actual file does not have it).
The ranges for each of these parameters are different.
TN can be 0 or 1, PN 0-7 and RO 0-3.
I understand why the following grammar does not work (0 and 1 matched by all lexer rules, 2 and 3 are matched by the PN and RO rules) but is there a way to achieve what i'm trying to do here.
grammar PARAM;
parameters: TN PN RO;
TN: [0-1];
RN: [0-7];
RO: [0-3];
WS : [ \r\t\n]+ -> skip ;
I like to match these overlapping numbers as different tokens. Otherwise i have to change my grammar to this and then in the the Java side check the ranges manually.
grammar PARAM;
parameters: DIGIT DIGIT DIGIT;
DIGIT: [0-7];
WS : [ \r\t\n]+ -> skip ;
Thanks.

Since the lexer does not know the context / number position on line (unless hacked by some custom code), it does not know whether to match 0 as TN, RN or RO. The right place to make this distinction is the parser.
You could do this to avoid checking the ranges in Java (although I would personally check them in Java rather than do this):
parameters: tn_param rn_param ro_param;
tn_param: TN_DIGIT;
rn_param: TN_DIGIT | RO_DIGIT | RN_DIGIT;
ro_param: TN_DIGIT | RO_DIGIT;
TN_DIGIT: [0-1];
RO_DIGIT: [2-3];
RN_DIGIT: [4-7];

Related

Multiline regex match in MariaDB/Mediawiki

I am trying to match text (contained in a Mediawiki template) in multiple lines via the Replace Text extension in MW 1.31, server running MariaDB 10.3.22.
An example of the template is the following (other templates may exist on the same page):
{{WoodhouseENELnames
|Text=[[File:woodhouse_999.jpg|thumb|link={{filepath:woodhouse_999.jpg}}]]Αἰακός, ὁ, or say, son of Aegina.
<b class="b2">Of Aeacus</b>, adj.: Αἰάκειος.
<b class="b2">Descendant of Aeacus</b>: Αἰακίδης, -ου, ὁ.
}}
Above and below could be other templates, with varying number of line breaks I.e.
{{MyTemplatename
|Text=text, text, text
}}
{{WoodhouseENELnames
|Text=text, text, text
}}
{{OtherTemplatename
|Text= text, text, text
}}
There are varying number of lines and/or line breaks within the template. I want to match the full template and delete it; that is match from {{WoodhouseENELnames to the closing }} but without matching any templates further down, that is, stop matching if further {{ are encountered.
The closest I got was using something like:
Find
({{WoodhouseENELnames\n\|Text=)(.*?)\n+(.*?)\n+(.*?)\n+(.*?)(\n+}})
And adding/removing (.*?)\n+ in the regex to match cases with more or less lines. The problem is that this expression might inadvertently match other templates following this one.
Is there a regex that would match all possible text/line breaks contained within the template (in a lazy way, as there may be other templates below and above) in the same page? The templates are delimited by opening {{ and closing }})?
Edited to clear up any confusing
This is a recursion simulation for use on
Java, Python style engines that do not support function calls (recursion)
(?s)(?={{WoodhouseENELnames)(?:(?=.*?{{(?!.*?\1)(.*}}(?!.*\2).*))(?=.*?}}(?!.*?\2)(.*)).)+?.*?(?=\1)(?:(?!{{).)*(?=\2$)
Recursion Simulation demo
Just check matchs for result
This is real recursion for use on Perl, PCRE style engines
(?s){{WoodhouseENELnames((?:(?>(?:(?!{{|}}).)+)|{{(?1)}})*)}}
Recursion demo
Note that Dot-Net is done differently and is not included here
I can only think of a brute-force, iterative approach using a recursive query.
The idea is to walk through the string, starting at the first occurence of string part '{{WoodhouseENELnames'. From there on, we can set a counter that keeps tracks of how many opening and closing brackets were met. When the count reaches 0, we know the pattern is exhausted. The final step is to rebuild a string that retains the parts prior to and after the pattern.
For this to work, you need a unique column to identify each row. I assumed id.
with recursive cte as (
select
n_open n0,
n_open n1,
1 cnt,
mycol,
id
from (select t.*, locate('{{WoodhouseENELnames', mycol) n_open from mytable t) x
where n_open > 0
union all
select
n0,
n1 + 2 + case when n_open > 0 and n_open < n_close then n_open else n_close end,
cnt + case when n_open > 0 and n_open < n_close then 1 else -1 end,
mycol,
id
from (
select
c.*,
locate('{{', substring(mycol, n1 + 2)) n_open,
locate('}}', substring(mycol, n1 + 2)) n_close
from cte c
) x
where cnt > 0
)
select id, concat(substring(mycol, 1, min(n0) - 1), substring(mycol, max(n1) + 1)) mycol
from cte
group by id
Demo on DB Fiddle
Set-up - I added string parts before and after the pattern (including double brackets for extra fun):
create table mytable(id int, mycol varchar(2000));
insert into mytable values (
1,
'{{abcd{{WoodhouseENELnames
|Text=[[File:woodhouse_999.jpg|thumb|link={{filepath:woodhouse_999.jpg}}]]Αἰακός, ὁ, or say, son of Aegina.
<b class="b2">Of Aeacus</b>, adj.: Αἰάκειος.
<b class="b2">Descendant of Aeacus</b>: Αἰακίδης, -ου, ὁ.
}} efgh{{'
);
Results:
id | mycol
-: | :------------
1 | {{abcd efgh{{
MariaDB uses the PCRE-Regex engine.
If you can assure, that
the opening tag of your template ({{WoodhouseENELnames) starts on a new line
the closing tag of your template (}}) starts on a new line
no other closing tags (}}) in between starts on a new line, the follwoing Regex will do:
(?ms)^{{WoodhouseENELnames.+?^}}
Description:
(?ms) thells the regex that ^ matches any linebreak in the text and that . also matches newlines.
Then serach for your opening tag on a new line.
Search for the shortest possible string including any character (also newlines) up to
a closing tag (}}) on a new line.
If you want to capture the match, enclose the regex within (and )
EDIT:
As The PCRE2 supports recursive patterns, the follwing, more complexe regex will match, regardless of the beginnig-of-line-constraints above:
(?msx)
({{WoodhouseENELnames # group 1: Matching the whole template
( # group 2: Mathing the contents of the Template, including subpatters.
[^{}]* # Search zewro or more characters except { or }
{{ # The beginning of a subpattern
( # Containg if:
[^{}]++ # Search zewro or more characters except { or }
| (?2) # or the recursive pattern group 2
)* # Zero or more times
}} # The closing of the subpattern.
[^{}]* # Search zewro or more characters except { or }
)
}}
)
Cave-at: Doesn't cater for single { or } within the templates.
EDIT 2
I hate giving up before the job is done :-) This regex should work regardless of all contstraints above:
(?msx) # Note the additional 'x'-Option, allowing free spacing.
({{WoodhouseENELnames # Searcdh group 1 - Top level template:
( # Search group 2 - top level template contents:
( # Search-group 3 - Subtemplate contents:
[^{}]* # Zero or more characters except { or }
| {(?!{) # or a single { not follwed by a {
| }(?!}) # or a single } not follwed by a }
)* # Closing search group 3
{{ # Opening subtemplate tag
( # Search group 4:
(?3)* # Reusing serach group 3, zero or more times
| (?2) # or Recurse search group 2 (of which, this is a part)
)* # Group 4 zero or more times
}} # Closing subtemplate tag
(?3)* # Reusing search group 3, zero or more times
) # Closing Search group 2 - Template contents
}} # Top-level Template closing tag
) # Closing Search group 1
The last two solutions are based on the PCRE2 documentation

SQL: run only one of two stataments if an internal condition is met

Don't know if someone here knows Tasker Android app, but I think everyone could globally understand what I'm looking to accomplish, because I will basically talk about "raw" SQL code, as it's written on most common languages.
First, this is what I want, roughly:
IF (SELECT * FROM ("january") WHERE ("day") = (19)) MATCHES [%records(#) = 1] END
ELSE
SELECT * FROM ("january") WHERE ("day") = (19) ORDER BY ("timea") DESC END
What I want to say above is: If in the first part of the code (IF ... END) the number of the resulting records, matching the number 19 on 'day' column, is just one, end execution here; but if more than one record is found, jump to the next part, after ELSE.
And if you are a Tasker user, you will understand the next (my current) setup:
A1: SQL Query [ Mode:Raw File:Tasker/Resources/Calendar Express/calendar_db Table:january Columns:day Query:SELECT * FROM ("january") WHERE ("day") = (19) Selection Parameters: Order By: Output Column Divider: Variable Array:%records Use Root:Off ]
A2: SQL Query [ Mode:Raw File:Tasker/Resources/Calendar Express/calendar_db Table:january Columns:day Query:SELECT * FROM ("january") WHERE ("day") = (19) ORDER BY ("timea") DESC Selection Parameters: Order By: Output Column Divider: Variable Array:%records Use Root:Off ] If [ %records(#) > 1 ]
An:...
So, as you can see, A1 will run always, without exceptions, getting the result in the variable array '%records()' (% is how Tasker identifies vars, as $ in other langs; and the use of parenthesis rather than brackets). Then, if the number of entries inside the array is just one, A2 will be jumped (if %records(#) > 1), and following actions are executed.
But, if after running A1 the %records() array contains 3, A2 action will be executed overwritting the content of %records() array, previoulsy set. But this time will contain same number of records (3), but reordered.
Is possible to do so, in just one code line? Thanks ;)
As 'sticky bit' replied on a comment before, I can just still using the second action, as it won't affect the output if it's only a single record. Solved!

Force FsCheck to generate NonEmptyString for discriminating union fields of type string

I'm trying to achieve the following behaviour with FsCheck: I'd like to create a generator that will generate a instance of MyUnion type, with every string field being non-null/empty.
type MyNestedUnion =
| X of string
| Y of int * string
type MyUnion =
| A of int * int * string * string
| B of MyNestedUnion
My 'real' type is much larger/deeper than the MyUnion, and FsCheck is able to generate a instance without any problem, but the string fields of the union cases are sometimes empty. (For example it might generate B (Y (123, "")))
Perhaps there's some obvious way of combining FsCheck's NonEmptyString and its support for generating arbitrary union types that I'm missing?
Any tips/pointers in the right direction greatly appreciated.
Thanks!
This goes against the grain of property based testing (in that you explicitly prevent valid test cases from being generated), but you could wire up the non-empty string generator to be used for all strings:
type Alt =
static member NonEmptyString () : Arbitrary<string> =
Arb.Default.NonEmptyString()
|> Arb.convert
(fun (nes : NonEmptyString) -> nes.Get)
NonEmptyString.NonEmptyString
Arb.register<Alt>()
let g = Arb.generate<MyUnion>
Gen.sample 1 10 g
Note that you'd need to re-register the default generator after the test since the mappings are global.
A more by-the-book solution would be to use the default derived generator and then filter values that contain invalid strings (i.e. use ==>), but you might find it not feasible for particularly deep nested types.

Pig Latin - Extracting fields meeting two different filter criteria from chararray line and grouping in a bag

I am new to Pig Latin.
I want to extract all lines that match a filter criteria (have a word "line_token" ) from log files and then from these matching lines extract two different fields meeting two separate field match criteria . Since the lines aren't structured well I am loading them as a char array.
When I try to run the following code - I get an error
"Invalid resource schema: bag schema must have tuple as its field"
I have tried to perform an explicit cast to a tuple but that does not work
input_lines = LOAD '/inputdir/' AS ( line:chararray);
filtered_lines = FILTER input_lines BY (line MATCHES '.*line_token1.*' );
tokenized_lines = FOREACH filtered_lines GENERATE FLATTEN(TOKENIZE(line)) AS tok_line;
my_wordbag = FOREACH tokenized_lines {
word1 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
word2 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
GENERATE word1 , word2 as my_tuple ;
-- I also tried --> GENERATE (word1 , word2) as my_tuple ;
}
dump my_wordbag;
I suppose I am taking a very wrong approach.
Please note - my logs aren't structured well - so I cant mend the way I load
Post loading and initial filtering for lines of interest ( which is straightforward) - I guess I need to do something different rather than tokenize line and iterate through fields trying to find fields.
Or maybe I should use joins ?
Also if I know the structure of line beforehand well as all text fields, then will loading it differently ( not as a chararray) make it an easier problem ?
For now I made a compromise - I added a extra filter clause in my original - line filter and settled for picking just one field from line. When I get back to it I will try with joins and post that code ... - here's my working code that gets me a useful output - but not all that I want.
-- read input lines from poorly structured log
input_lines = LOAD '/log-in-dir-in-hdfs' AS ( line:chararray) ;
-- Filter for line filter criteria and date interested in passed as arg
filtered_lines = FILTER input_lines BY (
( line MATCHES '.*line_filter1*' )
AND ( line MATCHES '.*line_filter2.*' )
AND ( line MATCHES '.*$forDate.*' )
) ;
-- Tokenize every line
tok_lines = FOREACH filtered_lines
GENERATE TOKENIZE(line) AS tok_line;
-- Pick up specific field frm tokenized line based on column filter criteria
fnames = FOREACH tok_lines {
fname = FILTER tok_line BY ( $0 MATCHES '.*field_selection.*' ) ;
GENERATE FLATTEN(fname) as nnfname;
}
-- Count occurances of that field and store it with field name
-- My original intent is to store another field name as well
-- I will do that once I figure how to put both of them in a tuple
flgroup = FOREACH fnames
GENERATE FLATTEN(TOKENIZE((chararray)$0)) as cfname;
grpfnames = group flgroup by cfname;
readcounts = FOREACH grpfnames GENERATE COUNT(flgroup), group ;
STORE readcounts INTO '/out-dir-in-hdfs';
As I understand, after the FLATTEN operation, you have single line (tok_line) in each row and you want to extract 2 words from each line. REGEX_EXTRACT will help you achieve this. I'm not a REGEX expert so will leave writing the REGEX part up to you.
data = FOREACH tokenized_lines
GENERATE
REGEX_EXTRACT(tok_line, <first word regex goes here>) as firstWord,
REGEX_EXTRACT(tok_line, <second word regex goes here>) as secondWord;
I hope this helps.
You must refer to the alias, not the column.
So:
word1 = FILTER tokenized_lines BY ( $0 MATCHES '.*word_token1.*' ) ;
word1 and word2 are going to be aliases as well, not columns.
How do you need the output to look like?

REGEXP_SUBSTR : extracting portion of string between [ ] including []

I am on Oracle 11gR2.
I am trying to extract the text between '[' and ']' including [].
ex:
select regexp_substr('select userid,username from tablename where user_id=[REQ.UID] and username=[REQD.VP.UNAME]','\[(.*)\]') from dual
Output:
[REQ.UID] and username=[REQD.VP.UNAME]
Output needed:
[REQ.UID][REQD.VP.UNAME]
Please let me know how to get the needed output.
Thanks & Regards,
Bishal
Assuming you are just going to have two occurrences of [] then the following should suffice. The ? in the .*? means that it is non-greedy so that it doesn't gobble up the last ].
select
regexp_replace('select userid,username from tablename where user_id=[REQ.UID] and username=[REQD.VP.UNAME]'
,'.*(\[.*?\]).*(\[.*?\]).*','\1\2')
from dual
;
I'm not an Oracle user, but from quick perusal of the docs, I think this should be close:
REGEXP_REPLACE('select userid,username from tablename where user_id=[REQ.UID] and username=[REQD.VP.UNAME]',
'^[^\[]*(\[[^\]]*\])[^\[]*(\[[^\]]*\])$', '\1 \2')
Which looks much nastier than it is.
Pattern is:
^[^\[]* Capture all characters up to (but not including) the first [
(\[[^\]]*\]) Capture into group 1 anything like [<not "]">]
[^\[]* Capture everything up to (nut not including) the next [
(\[[^\]]*\]) Capture into group 2 anything like [<not "]">], at the end of the string
Then the replacement is simple, just <grp 1> <grp 2>