I am trying to match text (contained in a Mediawiki template) in multiple lines via the Replace Text extension in MW 1.31, server running MariaDB 10.3.22.
An example of the template is the following (other templates may exist on the same page):
{{WoodhouseENELnames
|Text=[[File:woodhouse_999.jpg|thumb|link={{filepath:woodhouse_999.jpg}}]]Αἰακός, ὁ, or say, son of Aegina.
<b class="b2">Of Aeacus</b>, adj.: Αἰάκειος.
<b class="b2">Descendant of Aeacus</b>: Αἰακίδης, -ου, ὁ.
}}
Above and below could be other templates, with varying number of line breaks I.e.
{{MyTemplatename
|Text=text, text, text
}}
{{WoodhouseENELnames
|Text=text, text, text
}}
{{OtherTemplatename
|Text= text, text, text
}}
There are varying number of lines and/or line breaks within the template. I want to match the full template and delete it; that is match from {{WoodhouseENELnames to the closing }} but without matching any templates further down, that is, stop matching if further {{ are encountered.
The closest I got was using something like:
Find
({{WoodhouseENELnames\n\|Text=)(.*?)\n+(.*?)\n+(.*?)\n+(.*?)(\n+}})
And adding/removing (.*?)\n+ in the regex to match cases with more or less lines. The problem is that this expression might inadvertently match other templates following this one.
Is there a regex that would match all possible text/line breaks contained within the template (in a lazy way, as there may be other templates below and above) in the same page? The templates are delimited by opening {{ and closing }})?
Edited to clear up any confusing
This is a recursion simulation for use on
Java, Python style engines that do not support function calls (recursion)
(?s)(?={{WoodhouseENELnames)(?:(?=.*?{{(?!.*?\1)(.*}}(?!.*\2).*))(?=.*?}}(?!.*?\2)(.*)).)+?.*?(?=\1)(?:(?!{{).)*(?=\2$)
Recursion Simulation demo
Just check matchs for result
This is real recursion for use on Perl, PCRE style engines
(?s){{WoodhouseENELnames((?:(?>(?:(?!{{|}}).)+)|{{(?1)}})*)}}
Recursion demo
Note that Dot-Net is done differently and is not included here
I can only think of a brute-force, iterative approach using a recursive query.
The idea is to walk through the string, starting at the first occurence of string part '{{WoodhouseENELnames'. From there on, we can set a counter that keeps tracks of how many opening and closing brackets were met. When the count reaches 0, we know the pattern is exhausted. The final step is to rebuild a string that retains the parts prior to and after the pattern.
For this to work, you need a unique column to identify each row. I assumed id.
with recursive cte as (
select
n_open n0,
n_open n1,
1 cnt,
mycol,
id
from (select t.*, locate('{{WoodhouseENELnames', mycol) n_open from mytable t) x
where n_open > 0
union all
select
n0,
n1 + 2 + case when n_open > 0 and n_open < n_close then n_open else n_close end,
cnt + case when n_open > 0 and n_open < n_close then 1 else -1 end,
mycol,
id
from (
select
c.*,
locate('{{', substring(mycol, n1 + 2)) n_open,
locate('}}', substring(mycol, n1 + 2)) n_close
from cte c
) x
where cnt > 0
)
select id, concat(substring(mycol, 1, min(n0) - 1), substring(mycol, max(n1) + 1)) mycol
from cte
group by id
Demo on DB Fiddle
Set-up - I added string parts before and after the pattern (including double brackets for extra fun):
create table mytable(id int, mycol varchar(2000));
insert into mytable values (
1,
'{{abcd{{WoodhouseENELnames
|Text=[[File:woodhouse_999.jpg|thumb|link={{filepath:woodhouse_999.jpg}}]]Αἰακός, ὁ, or say, son of Aegina.
<b class="b2">Of Aeacus</b>, adj.: Αἰάκειος.
<b class="b2">Descendant of Aeacus</b>: Αἰακίδης, -ου, ὁ.
}} efgh{{'
);
Results:
id | mycol
-: | :------------
1 | {{abcd efgh{{
MariaDB uses the PCRE-Regex engine.
If you can assure, that
the opening tag of your template ({{WoodhouseENELnames) starts on a new line
the closing tag of your template (}}) starts on a new line
no other closing tags (}}) in between starts on a new line, the follwoing Regex will do:
(?ms)^{{WoodhouseENELnames.+?^}}
Description:
(?ms) thells the regex that ^ matches any linebreak in the text and that . also matches newlines.
Then serach for your opening tag on a new line.
Search for the shortest possible string including any character (also newlines) up to
a closing tag (}}) on a new line.
If you want to capture the match, enclose the regex within (and )
EDIT:
As The PCRE2 supports recursive patterns, the follwing, more complexe regex will match, regardless of the beginnig-of-line-constraints above:
(?msx)
({{WoodhouseENELnames # group 1: Matching the whole template
( # group 2: Mathing the contents of the Template, including subpatters.
[^{}]* # Search zewro or more characters except { or }
{{ # The beginning of a subpattern
( # Containg if:
[^{}]++ # Search zewro or more characters except { or }
| (?2) # or the recursive pattern group 2
)* # Zero or more times
}} # The closing of the subpattern.
[^{}]* # Search zewro or more characters except { or }
)
}}
)
Cave-at: Doesn't cater for single { or } within the templates.
EDIT 2
I hate giving up before the job is done :-) This regex should work regardless of all contstraints above:
(?msx) # Note the additional 'x'-Option, allowing free spacing.
({{WoodhouseENELnames # Searcdh group 1 - Top level template:
( # Search group 2 - top level template contents:
( # Search-group 3 - Subtemplate contents:
[^{}]* # Zero or more characters except { or }
| {(?!{) # or a single { not follwed by a {
| }(?!}) # or a single } not follwed by a }
)* # Closing search group 3
{{ # Opening subtemplate tag
( # Search group 4:
(?3)* # Reusing serach group 3, zero or more times
| (?2) # or Recurse search group 2 (of which, this is a part)
)* # Group 4 zero or more times
}} # Closing subtemplate tag
(?3)* # Reusing search group 3, zero or more times
) # Closing Search group 2 - Template contents
}} # Top-level Template closing tag
) # Closing Search group 1
The last two solutions are based on the PCRE2 documentation
Related
Trying to pass a star (*) in a sql Hana place holder with an arrow notation
The following works OK:
Select * FROM "table_1"
( PLACEHOLDER."$$IP_ShipmentStartDate$$" => '2020-01-01',
PLACEHOLDER."$$IP_ShipmentEndDate$$" => '2030-01-01' )
In the following, when trying to pass a *, i get a syntax error:
Select * FROM "table1"
( PLACEHOLDER."$$IP_ShipmentStartDate$$" => '2020-01-01',
PLACEHOLDER.'$$IP_ItemTypecd$$' => '''*''',
PLACEHOLDER."$$IP_ShipmentEndDate$$" => '2030-01-01' )
The reason i am using the arrow notation, is since its the only way i know that allows passing parameters as in the example bellow: (as in linked post)
do begin
declare lv_param nvarchar(100);
select max('some_date')
into lv_param
from dummy /*your_table*/;
select * from "_SYS_BIC"."path.to.your.view/CV_TEST" (
PLACEHOLDER."$$P_DUMMY$$" => :lv_param
);
end;
There's a typo in your code. You need to use double quotes around parameter name, but you have a single quote. It should be: PLACEHOLDER."$$IP_ItemTypecd$$".
When you pass something to Calculation View's parameter, you already have a string, that will be treated as string and have quotes around it where they needed, so no need to add more. But if you really need to pass some quotes inside the placeholder's value you also need to escape them with backslash complementary to doubling them (it was found by doing data preview on calculation view and entering '*' as a value of input parameter, then you'll find valid SQL statement in the log of preview):
do
begin
select *
from "_SYS_BIC"."ztest/CV_TEST_PERF"(
PLACEHOLDER."$$P_DUMMY$$" => '''*'''
);
end;
/*
SAP DBTech JDBC: [339]: invalid number: : line 3 col 3 (at pos 13): invalid number:
not a valid number string '' at function __typecast__()
*/
/*And in trace there's no more information, but interesting part
is preparation step, not an execution
w SQLScriptExecuto se_eapi_proxy.cc(00145) : Error <exception 71000339:
not a valid number string '' at function __typecast__()
> in preparation of internal statement:
*/
do
begin
select *
from "_SYS_BIC"."ztest/CV_TEST_PERF"(
PLACEHOLDER."$$P_DUMMY$$" => '\'*\''
);
end;
/*
SAP DBTech JDBC: [257]: sql syntax error: incorrect syntax near "\": line 5 col 38 (at pos 121)
*/
But this is ok:
do
begin
select *
from "_SYS_BIC"."ztest/CV_TEST_PERF"(
PLACEHOLDER."$$P_DUMMY$$" => '\''*\'''
);
end;
LOG_ID | DATUM | INPUT_PARAM | CUR_DATE
--------------------------+----------+-------------+---------
8IPYSJ23JLVZATTQYYBUYMZ9V | 20201224 | '*' | 20201224
3APKAAC9OGGM2T78TO3WUUBYR | 20201224 | '*' | 20201224
F0QVK7BVUU5IQJRI2Q9QLY0WJ | 20201224 | '*' | 20201224
CW8ISV4YIAS8CEIY8SNMYMSYB | 20201224 | '*' | 20201224
What about the star itself:
As #LarsBr already said, in SQL you need to use LIKE '%pattern%' to search for strings contains parretn in the middle, % is equivalent for ABAP's * (but as far as I know * is more verbose placeholder in non-SQL world). So there's no out-of-the-box conversion of FIELD = '*' to FIELD like '%' or something similar.
But there's no LIKE predicate in Column Engine (in filter or in calculated column).
If you really need LIKE functionality in filter or calculated column, you can:
Switch execution engine to SQL
Or use match(arg, pattern) function of Column Engine, which now dissapeared from the pallete and is hidden quite well in the documentation (here, at the very end of the page, after digging into the description field of the last row in the table, you'll find the actual syntax for it. Damn!).
But here you'll meet another surprise: as long as Column Engine has different operators than SQL (it is more internal and more close to the DB core), it uses star (*) for wildcard character. So for match(string, pattern) you need to use a star again: match('pat string tern', 'pat*tern').
After all the above said: there are cases where you can really want to search for data with wildcards and pass them as parameter. But then you need to use match and pass the parameter as plain text without tricks on star (*) or something (if you want to use officially supported features, not trying to exploit some internals).
After adding this filter to RSPCLOGCHAIN projection node of my CV from the previous thread, it works this way:
do
begin
select *
from "_SYS_BIC"."ztest/CV_TEST_PERF"(
PLACEHOLDER."$$P_DUMMY$$" => 'CW*'
);
end;
LOG_ID | DATUM | INPUT_PARAM | CUR_DATE
--------------------------+----------+-------------+---------
CW8ISV4YIAS8CEIY8SNMYMSYB | 20201224 | CW* | 20201224
do
begin
select *
from "_SYS_BIC"."ztest/CV_TEST_PERF"(
PLACEHOLDER."$$P_DUMMY$$" => 'CW'
);
end;
/*
Fetched 0 row(s) in 0 ms 0 µs (server processing time: 0 ms 0 µs)
*/
The notation with triple quotation marks '''*''' is likely what yields the syntax error here.
Instead, use single quotation marks to provide the '*' string.
But that is just half of the challenge here.
In SQL, the placeholder search is done via LIKE and the placeholder character is %, not *.
To mimic the ABAP behaviour when using calculation views, the input parameters must be used in filter expressions in the calculation view. And these filter expressions have to check for whether the input parameter value is * or not. If it is * then the filter condition needs to be a LIKE, otherwise an = (equal) condition.
A final comment: the PLACEHOLDER-syntax really only works with calculation views and not with tables.
My project is a Latin language learning app. My DB has all the words I'm teaching, in the table 'words'. It has the lemma (the main form of the word), along with the definition and other information the user needs to learn.
I show one word at a time for them to guess/remember what it means. The correct word is shown along with some wrong words, like:
What does Romanus mean? Greek - /Roman/ - Phoenician - barbarian
What does domus mean? /house/ - horse - wall - senator
The wrong options are randomly drawn from the same table, and must be from the same part of speech (adjective, noun...) as the correct word; but I am only interested in their lemma. My return value looks like this (some properties omitted):
[
{ lemma: 'Romanus', definition: 'Roman', options: ['Greek', 'Phoenician', 'barbarian'] },
{ lemma: 'domus', definition: 'house', options: ['horse', 'wall', 'senator'] }
]
What I am looking for is a more efficient way of doing it than my current approach, which runs a new query for each word:
// All the necessary requires are here
class Word extends Model {
static async fetch() {
const words = await this.findAll({
limit: 10,
order: [Sequelize.literal('RANDOM()')],
attributes: ['lemma', 'definition'], // also a few other columns I need
});
const wordsWithOptions = await Promise.all(words.map(this.addOptions.bind(this)));
return wordsWithOptions;
}
static async addOptions(word) {
const options = await this.findAll({
order: [Sequelize.literal('RANDOM()')],
limit: 3,
attributes: ['lemma'],
where: {
partOfSpeech: word.dataValues.partOfSpeech,
lemma: { [Op.not]: word.dataValues.lemma },
},
});
return { ...word.dataValues, options: options.map((row) => row.dataValues.lemma) };
}
}
So, is there a way I can do this with raw SQL? How about Sequelize? One thing that still helps me is to give a name to what I'm trying to do, so that I can Google it.
EDIT: I have tried the following and at least got somewhere:
const words = await this.findAll({
limit: 10,
order: [Sequelize.literal('RANDOM()')],
attributes: {
include: [[sequelize.literal(`(
SELECT lemma FROM words AS options
WHERE "partOfSpeech" = "options"."partOfSpeech"
ORDER BY RANDOM() LIMIT 1
)`), 'options']],
},
});
Now, there are two problems with this. First, I only get one option, when I need three; but if the query has LIMIT 3, I get: SequelizeDatabaseError: more than one row returned by a subquery used as an expression.
The second error is that while the code above does return something, it always gives the same word as an option! I thought to remedy that with WHERE "partOfSpeech" = "options"."partOfSpeech", but then I get SequelizeDatabaseError: invalid reference to FROM-clause entry for table "words".
So, how do I tell PostgreSQL "for each row in the result, add a column with an array of three lemmas, WHERE existingRow.partOfSpeech = wordToGoInTheArray.partOfSpeech?"
Revised
Well that seems like a different question and perhaps should be posted that way, but...
The main technique remains the same. JOIN instead of sub-select. The difference being generating the list of lemmas for then piping then into the initial query. In a single this can get nasty.
As single statement (actually this turned out not to be too bad):
select w.lemma, w.defination, string_to_array(string_agg(o.defination,','), ',') as options
from words w
join lateral
(select defination
from words o
where o.part_of_speech = w.part_of_speech
and o.lemma != w.lemma
order by random()
limit 3
) o on 1=1
where w.lemma in( select lemma
from words
order by random()
limit 4 --<<< replace with parameter
)
group by w.lemma, w.defination;
The other approach build a small SQL function to randomly select a specified number of lemmas. This selection is the piped into the (renamed) function previous fiddle.
create or replace
function exam_lemma_definition_options(lemma_array_in text[])
returns table (lemma text
,definition text
,option text[]
)
language sql strict
as $$
select w.lemma, w.definition, string_to_array(string_agg(o.definition,','), ',') as options
from words w
join lateral
(select definition
from words o
where o.part_of_speech = w.part_of_speech
and o.lemma != w.lemma
order by random()
limit 3
) o on 1=1
where w.lemma = any(lemma_array_in)
group by w.lemma, w.definition;
$$;
create or replace
function exam_lemmas(num_of_lemmas integer)
returns text[]
language sql
strict
as $$
select string_to_array(string_agg(lemma,','),',')
from (select lemma
from words
order by random()
limit num_of_lemmas
) ll
$$;
Using this approach your calling code reduces to a needs a single SQL statement:
select *
from exam_lemma_definition_options(exam_lemmas(4))
order by lemma;
This permits you to specify the numbers of lemmas to select (in this case 4) limited only by the number of rows in Words table. See revised fiddle.
Original
Instead of using a sub-select to get the option words just JOIN.
select w.lemma, w.definition, string_to_array(string_agg(o.definition,','), ',') as options
from words w
join lateral
(select definition
from words o
where o.part_of_speech = w.part_of_speech
and o.lemma != w.lemma
order by random()
limit 3
) o on 1=1
where w.lemma = any(array['Romanus', 'domus'])
group by w.lemma, w.definition;
See fiddle. Obviously this will not necessary produce the same options as your questions provides due to random() selection. But it will get matching parts of speech. I will leave translation to your source language to you; or you can use the function option and reduce your SQL to a simple "select *".
I'm trying to find the most efficient way to remove overlapping substrings from a string field value on BigQuery. My use case is the same as Combining multiple regex substitutions but within BigQuery.
If I sum up the post above:
With the following list of substrings: ["quick brown fox", "fox jumps"]
I want:
A quick brown fox jumps over the lazy dog to be replaced by A over the lazy dog.
My thoughts were to come up with a JS UDF that does a similar job than what's mentioned in the post above i.e. to create a mask of the whole string and loop over the substrings to identify which characters to remove... But do you have better ideas?
Thanks for your help
I couldn't find out how to do this in Standard SQL
Below is for BigQuery Standard SQL and does whole thing in one shot - just one [simple] query
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'A quick brown fox jumps over the lazy dog' text
), list AS (
SELECT ['quick brown fox', 'fox jumps'] phrases
)
SELECT text AS original_text, REGEXP_REPLACE(text, STRING_AGG(pattern, '|'), '') processed_text FROM (
SELECT DISTINCT text, SUBSTR(text, MIN(start), MAX(finish) - MIN(start) + 1) pattern FROM (
SELECT *, COUNTIF(flag) OVER(PARTITION BY text ORDER BY start) grp FROM (
SELECT *, start > LAG(finish) OVER(PARTITION BY text ORDER BY start) flag FROM (
SELECT *, start + phrase_len - 1 AS finish FROM (
SELECT *, LENGTH(cut) + 1 + OFFSET * phrase_len + IFNULL(SUM(LENGTH(cut)) OVER(win), 0) start
FROM `project.dataset.table`, list,
UNNEST(phrases) phrase,
UNNEST([LENGTH(phrase)]) phrase_len,
UNNEST(REGEXP_EXTRACT_ALL(text, r'(.+?)' || phrase)) cut WITH OFFSET
WINDOW win AS (PARTITION BY text, phrase ORDER BY OFFSET ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)))) GROUP BY text, grp
) GROUP BY text
with output
Row original_text processed_text
1 A quick brown fox jumps over the lazy dog A over the lazy dog
I tested above with few more complex / tricky texts and it still worked
Brief explanation:
gather all inclusions of phrases in list and their respective starts and ends
combine overlapping fragments and calculate their respective starts and ends
extract new fragments based on starts and end from above step 2
order DESC them by length and generate regexp expression
finally do REGEXP_REPLACE using regexp generated in above step 4
Above might look messy - but in reality it does all above in one query and in pure SQL
Using a custom JS UDF seems to work, but i've seen faster BigQuery..!
CREATE FUNCTION `myproject.mydataset.keyword_remover_js`(label STRING) RETURNS STRING LANGUAGE js AS """
var keywords = ["a quick brown fox", "fox jumps"] ;
var mask = new Array(label.length).fill(1);
var reg = new RegExp("(" + keywords.join("|") + ")", 'g');
var found;
while (found = reg.exec(label)) {
for (var i = found.index; i < reg.lastIndex; i++) {
mask[i] = 0;
}
reg.lastIndex = found.index+1;
}
var result = []
for (var i = 0; i < label.length; i++) {
if (mask[i]) {
result.push(label[i])
}
}
return result.join('').replace(/ +/g,' ').replace(/^ +| +$/,'')
""";
Don't know if someone here knows Tasker Android app, but I think everyone could globally understand what I'm looking to accomplish, because I will basically talk about "raw" SQL code, as it's written on most common languages.
First, this is what I want, roughly:
IF (SELECT * FROM ("january") WHERE ("day") = (19)) MATCHES [%records(#) = 1] END
ELSE
SELECT * FROM ("january") WHERE ("day") = (19) ORDER BY ("timea") DESC END
What I want to say above is: If in the first part of the code (IF ... END) the number of the resulting records, matching the number 19 on 'day' column, is just one, end execution here; but if more than one record is found, jump to the next part, after ELSE.
And if you are a Tasker user, you will understand the next (my current) setup:
A1: SQL Query [ Mode:Raw File:Tasker/Resources/Calendar Express/calendar_db Table:january Columns:day Query:SELECT * FROM ("january") WHERE ("day") = (19) Selection Parameters: Order By: Output Column Divider: Variable Array:%records Use Root:Off ]
A2: SQL Query [ Mode:Raw File:Tasker/Resources/Calendar Express/calendar_db Table:january Columns:day Query:SELECT * FROM ("january") WHERE ("day") = (19) ORDER BY ("timea") DESC Selection Parameters: Order By: Output Column Divider: Variable Array:%records Use Root:Off ] If [ %records(#) > 1 ]
An:...
So, as you can see, A1 will run always, without exceptions, getting the result in the variable array '%records()' (% is how Tasker identifies vars, as $ in other langs; and the use of parenthesis rather than brackets). Then, if the number of entries inside the array is just one, A2 will be jumped (if %records(#) > 1), and following actions are executed.
But, if after running A1 the %records() array contains 3, A2 action will be executed overwritting the content of %records() array, previoulsy set. But this time will contain same number of records (3), but reordered.
Is possible to do so, in just one code line? Thanks ;)
As 'sticky bit' replied on a comment before, I can just still using the second action, as it won't affect the output if it's only a single record. Solved!
I am new to Pig Latin.
I want to extract all lines that match a filter criteria (have a word "line_token" ) from log files and then from these matching lines extract two different fields meeting two separate field match criteria . Since the lines aren't structured well I am loading them as a char array.
When I try to run the following code - I get an error
"Invalid resource schema: bag schema must have tuple as its field"
I have tried to perform an explicit cast to a tuple but that does not work
input_lines = LOAD '/inputdir/' AS ( line:chararray);
filtered_lines = FILTER input_lines BY (line MATCHES '.*line_token1.*' );
tokenized_lines = FOREACH filtered_lines GENERATE FLATTEN(TOKENIZE(line)) AS tok_line;
my_wordbag = FOREACH tokenized_lines {
word1 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
word2 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
GENERATE word1 , word2 as my_tuple ;
-- I also tried --> GENERATE (word1 , word2) as my_tuple ;
}
dump my_wordbag;
I suppose I am taking a very wrong approach.
Please note - my logs aren't structured well - so I cant mend the way I load
Post loading and initial filtering for lines of interest ( which is straightforward) - I guess I need to do something different rather than tokenize line and iterate through fields trying to find fields.
Or maybe I should use joins ?
Also if I know the structure of line beforehand well as all text fields, then will loading it differently ( not as a chararray) make it an easier problem ?
For now I made a compromise - I added a extra filter clause in my original - line filter and settled for picking just one field from line. When I get back to it I will try with joins and post that code ... - here's my working code that gets me a useful output - but not all that I want.
-- read input lines from poorly structured log
input_lines = LOAD '/log-in-dir-in-hdfs' AS ( line:chararray) ;
-- Filter for line filter criteria and date interested in passed as arg
filtered_lines = FILTER input_lines BY (
( line MATCHES '.*line_filter1*' )
AND ( line MATCHES '.*line_filter2.*' )
AND ( line MATCHES '.*$forDate.*' )
) ;
-- Tokenize every line
tok_lines = FOREACH filtered_lines
GENERATE TOKENIZE(line) AS tok_line;
-- Pick up specific field frm tokenized line based on column filter criteria
fnames = FOREACH tok_lines {
fname = FILTER tok_line BY ( $0 MATCHES '.*field_selection.*' ) ;
GENERATE FLATTEN(fname) as nnfname;
}
-- Count occurances of that field and store it with field name
-- My original intent is to store another field name as well
-- I will do that once I figure how to put both of them in a tuple
flgroup = FOREACH fnames
GENERATE FLATTEN(TOKENIZE((chararray)$0)) as cfname;
grpfnames = group flgroup by cfname;
readcounts = FOREACH grpfnames GENERATE COUNT(flgroup), group ;
STORE readcounts INTO '/out-dir-in-hdfs';
As I understand, after the FLATTEN operation, you have single line (tok_line) in each row and you want to extract 2 words from each line. REGEX_EXTRACT will help you achieve this. I'm not a REGEX expert so will leave writing the REGEX part up to you.
data = FOREACH tokenized_lines
GENERATE
REGEX_EXTRACT(tok_line, <first word regex goes here>) as firstWord,
REGEX_EXTRACT(tok_line, <second word regex goes here>) as secondWord;
I hope this helps.
You must refer to the alias, not the column.
So:
word1 = FILTER tokenized_lines BY ( $0 MATCHES '.*word_token1.*' ) ;
word1 and word2 are going to be aliases as well, not columns.
How do you need the output to look like?