Pig Latin - Extracting fields meeting two different filter criteria from chararray line and grouping in a bag - apache-pig

I am new to Pig Latin.
I want to extract all lines that match a filter criteria (have a word "line_token" ) from log files and then from these matching lines extract two different fields meeting two separate field match criteria . Since the lines aren't structured well I am loading them as a char array.
When I try to run the following code - I get an error
"Invalid resource schema: bag schema must have tuple as its field"
I have tried to perform an explicit cast to a tuple but that does not work
input_lines = LOAD '/inputdir/' AS ( line:chararray);
filtered_lines = FILTER input_lines BY (line MATCHES '.*line_token1.*' );
tokenized_lines = FOREACH filtered_lines GENERATE FLATTEN(TOKENIZE(line)) AS tok_line;
my_wordbag = FOREACH tokenized_lines {
word1 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
word2 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
GENERATE word1 , word2 as my_tuple ;
-- I also tried --> GENERATE (word1 , word2) as my_tuple ;
}
dump my_wordbag;
I suppose I am taking a very wrong approach.
Please note - my logs aren't structured well - so I cant mend the way I load
Post loading and initial filtering for lines of interest ( which is straightforward) - I guess I need to do something different rather than tokenize line and iterate through fields trying to find fields.
Or maybe I should use joins ?
Also if I know the structure of line beforehand well as all text fields, then will loading it differently ( not as a chararray) make it an easier problem ?
For now I made a compromise - I added a extra filter clause in my original - line filter and settled for picking just one field from line. When I get back to it I will try with joins and post that code ... - here's my working code that gets me a useful output - but not all that I want.
-- read input lines from poorly structured log
input_lines = LOAD '/log-in-dir-in-hdfs' AS ( line:chararray) ;
-- Filter for line filter criteria and date interested in passed as arg
filtered_lines = FILTER input_lines BY (
( line MATCHES '.*line_filter1*' )
AND ( line MATCHES '.*line_filter2.*' )
AND ( line MATCHES '.*$forDate.*' )
) ;
-- Tokenize every line
tok_lines = FOREACH filtered_lines
GENERATE TOKENIZE(line) AS tok_line;
-- Pick up specific field frm tokenized line based on column filter criteria
fnames = FOREACH tok_lines {
fname = FILTER tok_line BY ( $0 MATCHES '.*field_selection.*' ) ;
GENERATE FLATTEN(fname) as nnfname;
}
-- Count occurances of that field and store it with field name
-- My original intent is to store another field name as well
-- I will do that once I figure how to put both of them in a tuple
flgroup = FOREACH fnames
GENERATE FLATTEN(TOKENIZE((chararray)$0)) as cfname;
grpfnames = group flgroup by cfname;
readcounts = FOREACH grpfnames GENERATE COUNT(flgroup), group ;
STORE readcounts INTO '/out-dir-in-hdfs';

As I understand, after the FLATTEN operation, you have single line (tok_line) in each row and you want to extract 2 words from each line. REGEX_EXTRACT will help you achieve this. I'm not a REGEX expert so will leave writing the REGEX part up to you.
data = FOREACH tokenized_lines
GENERATE
REGEX_EXTRACT(tok_line, <first word regex goes here>) as firstWord,
REGEX_EXTRACT(tok_line, <second word regex goes here>) as secondWord;
I hope this helps.

You must refer to the alias, not the column.
So:
word1 = FILTER tokenized_lines BY ( $0 MATCHES '.*word_token1.*' ) ;
word1 and word2 are going to be aliases as well, not columns.
How do you need the output to look like?

Related

Apache pig Store based on condition

I'm reading from a csv file and after grouping those datas I'm doing a count operation . Is there any way to store the datas into a folder name bad if the count is 0 and to good if the count is > 0 . I tried with the below code but it is not happening .
CODE :
STORE countVal INTO '/user/cloudera/good' IF countVal > 0 ;
USE function SPLIT. Refer :
https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SPLIT
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
There are a couple of ways this :
1)Use the split function to perform the split based on the criteria.
SPLIT data into good if count>0, bad if (count==0);
2)Use a FOREACH loop to separate the data based on a criteria, using a BinCond operator.
X = FOREACH A GENERATE , data, (count>0?"good":"bad");

How to make multiple rows from single row in pig for movie lens dataset

I want to divide a single row into multiple row on the basis of a field in pig.
Example:
Consider one of the row in movie Data Set as follows:
(31807, Dot the I (2003), Drama|Film-Noir|Thriller)
each field is separated by ','.
Desired Output is as follows in 3 different rows:
31807,Dot the I (2003),Drama
31807,Dot the I (2003),Film-Noir
31807,Dot the I (2003),Thriller
Can anyone please help me to get the desired output in pig.
The below logic will help you .
/* Input
(31807,Dot the I (2003),Drama|Film-Noir|Thriller)
*/
list = LOAD '/user/cloudera/movies.txt' USING PigStorage(',') AS(id:int,name:chararray,generes:chararray);
list_each = FOREACH list GENERATE id,name, flatten(TOKENIZE(generes,'|'));
dump list_each;
/* Output
(31807,Dot the I (2003),Drama)
(31807,Dot the I (2003),Film-Noir)
(31807,Dot the I (2003),Thriller)
*/

Convesion from Hive to PigLatin

I am trying to convert the below Hive statement to Pig:
max(substr(case when url like 'http:%' then '' else url end,1,50))
My pig statement for the above is:
url_group = GROUP data by (uid);
max_substr_url= FOREACH url_group generate SUBSTRING(MAX(((Coalesce(data.url) matches '.*http:%.*') ? '' : Coalesce(data.url))), 0, 49);
For some of the data, the url can be null. So I have written a pig UDF called Coalesce(String) which returns an empty string if the data is either null or empty. If the data is not null or not empty it returns the string back.
The above pig statement is giving me lot of trouble and tried n different options/ways but nothing worked. Anyone got any ideas on how to implement this? Please help me.
Thanks in advance
You are going to want to use a nested FOREACH so that you can do the substring transformation on each tuple in the data bag then take the MAX of the transformed bag.
A = GROUP data by (uid);
B = FOREACH url_group {
-- MAX needs a one column bag
transformed = FOREACH data
GENERATE SUBSTRING((Coalesce(url) matches '.*http:.*' ? '' : Coalesce(url)), 0, 49);
GENERATE group AS uid, MAX(transformed) ;
}

SQL regex and field

I want to change the query to return multiply values in extra_fields, how can I change the regex? Also I don't understand what extra_fields is - is it a field? If so why it is not called with the table prefix like i.extra_fields?
SELECT i.*,
CASE WHEN i.modified = 0 THEN i.created ELSE i.modified END AS lastChanged,
c.name AS categoryname,
c.id AS categoryid,
c.alias AS categoryalias,
c.params AS categoryparams
FROM #__k2_items AS i
LEFT JOIN #__k2_categories AS c ON c.id = i.catid
WHERE i.published = 1
AND i.access IN(1,1)
AND i.trash = 0
AND c.published = 1
AND c.access IN(1,1)
AND c.trash = 0
AND (i.publish_up = '0000-00-00 00:00:00'
OR i.publish_up <= '2013-06-12 22:45:19'
)
AND (i.publish_down = '0000-00-00 00:00:00'
OR i.publish_down >= '2013-06-12 22:45:19'
)
AND extra_fields REGEXP BINARY '(.*{"id":"2","value":\["[^\"]*1[^\"]*","[^\"]*2[^\"]*","[^\"]*3[^\"]*"\]}.*)'
ORDER BY i.id DESC
The extra_fields is a column of the #__k2_items table. The table qualifier can be omitted, because it is not ambiguous in this query. The column is JSON encoded. That is a serialization format used to store information which is not searchable by design. Applying a RegExp may work one day, but fail another day, since there is no guarantee for id preceeding value (as in your example).
The right way
The right way to filter this is to ignore the extra_fields condition in the SQL query an evaluate in the resultset instead. Example:
$rows = $db->loadObjectList('id');
foreach ($rows as $id => $row) {
$extra_fields = json_decode($row->extra_fields);
if ($extra_fields->id != 2) {
unset($rows[$id]);
}
}
The short way
If you can't change the database layout (which is true for extensions you want to keep updateable), you must split the condition into two, because there is no guarantee for a certain order of the subfields. For some reason, one day value may occur before id. So change your query to
...
AND extra_fields LIKE '%"id":"2"%'
AND extra_fields REGEXP BINARY '"value":\[("[^\"]*[123][^\"]*",?)+\]'
Prepare an intermediate table to hold the contents of extra_fields. Each extra_fields field will be converted into a series of records. Then do a join.
Create a trigger and cronjob to keep the temp table in sync.
Another way is to write UDF in Perl that will decode the field, but AFAIK it is not indexable in mysql.
Using an external search engine is out of scope.
Ok, i didnt want to change the db strucure, i gost some help and changed the regex intoAND extra_fields REGEXP BINARY '(.*{"id":"2","value":\[("[^\"]*[123][^\"]*",?)+\]}.*)'
and i got the right resaults
Thanks

Regex capturing inside a group

I working on a method to get all values based on a SQL query and then scape them in php.
The idea is to get the programmer who is careless about security when is doing a SQL query.
So when I try to execute this:
INSERT INTO tabla (a, b,c,d) VALUES ('a','b','c',a,b)
The regex needs to capture 'a' 'b' 'c' a and b
I was working on this a couple of days.
This was as far I can get with 2 regex querys, but I want to know if there is a better way to do:
VALUES ?\((([\w'"]+).+?)\)
Based on the previous SQL this will match:
VALUES ('a','b','c',a,b)
The second regex
['"]?(\w)['"]?
Will match
a b c a b
Previously removing VALUES, of course.
This way will match a lot of the values I gonna insert.
But doesn't work with JSON for example.
{a:b, "asd":"ads" ....}
Any help with this?
First, I think you should know that SQL support many types of single/double quoted string:
'Northwind\'s category name'
'Northwind''s category name'
"Northwind \"category\" name"
"Northwind ""category"" name"
"Northwind category's name"
'Northwind "category" name'
'Northwind \\ category name'
'Northwind \ncategory \nname'
to match them, try with these patterns:
"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"
'[^\\']*(?:(?:\\.|'')[^\\']*)*'
combine patterns together:
VALUES\s*\(\s*(?:"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"|'[^\\']*(?:(?:\\.|'')[^\\']*)*'|\w+)(?:\s*,\s*(?:"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"|'[^\\']*(?:(?:\\.|'')[^\\']*)*'|\w+))*\)
PHP5.4.5 sample code:
<?php
$pat = '/\bVALUES\s*\((\s*(?:"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"|\'[^\\\']*(?:(?:\\.|\'\')[^\\\']*)*\'|\w+)(?:\s*,\s*(?:"[^\\"]*(?:(?:\\.|"")[^\\"]*)*"|\'[^\\\']*(?:(?:\\.|\'\')[^\\\']*)*\'|\w+))*)\)/';
$sql_sample1 = "INSERT INTO tabla (a, b,c,d) VALUES ('a','b','c',a,b)";
if( preg_match($pat, $sql_sample1, $matches) > 0){
printf("%s\n", $matches[0]);
printf("%s\n\n", $matches[1]);
}
$sql_sample2 = 'INSERT INTO tabla (a, b,c,d) VALUES (\'a\',\'{a:b, "asd":"ads"}\',\'c\',a,b)';
if( preg_match($pat, $sql_sample2, $matches) > 0){
printf("%s\n", $matches[0]);
printf("%s\n", $matches[1]);
}
?>
output:
VALUES ('a','b','c',a,b)
'a','b','c',a,b
VALUES ('a','{a:b, "asd":"ads"}','c',a,b)
'a','{a:b, "asd":"ads"}','c',a,b
If you need to get each value from result, split by , (like parsing CSV)
I hope this will help you :)