How do I specify a field name for a wrapped tuple in Pig? - schema

I have a tuple with schema (a:int, b:int, c:int) stored in alias first. I want to convert each tuple to have a new relation second with schema like this:
(d: (a:int, b:int, c:int))
Basically, I've wrapped my initial tuple in another tuple and named the field. This is in preparation for a cross operation where I want to cross two relations but keep each one in a named field.
Here is what I would expect it to look like, except there's an error:
second = FOREACH first GENERATE TOTUPLE(*) AS (d:tuple);
This errors out too:
second = FOREACH first GENERATE TOTUPLE(*) AS (d:tuple (a:int, b:int, c:int));
Thanks!
Uri

What about:
second = FOREACH first GENERATE TOTUPLE(*) AS d;
describe second;
second: {d: (a: int,b: int,c: int)}

Related

Pig : Adding new column to existing inner Tuple in Pig

I want to add new column to existing tuple column in Pig.
Example:Input Schema:
name: chararray,
attribute_list: {innertuple: (height: int,length: int,size: chararray)}
Output Schema:
Using generate statement I want to add new column in tuple which will hold the same value as length but with some other name.
name: chararray,
attribute_list: {innertuple: (height: int,length: int,size: chararray, len : int)}
I tried below approach but its not working:
op = Foreach input_data generate
name,
attribute_list as attr : {(
height,
length,
size,
length as len)};
Please suggest.
Thanks in advance
Option 1:
Add a rank to each row, flatten attribute_list bag then recreate bag with additional columns.
--Rank input_schema(ip) using rank function:
ranked= rank ip;
-- flatten each value of bag.tuple to row level
a= foreach ranked generate rank_ip as id, flatten(attribute_list.$0), flatten(attribute_list.$0.length) as len;
b= group a by id;
op= foreach b generate flatten($1.name) as name, $1 as attr;
--The name also will be part of attr bag.
Option 2:
a. The DataFu has a pig udf to concat multiple tuple around bag.
b. Create UDF BagConcat.
define BagConcat datafu.pig.bags.BagConcat();
c. Flatten elements:
a= foreach ip generate name, flatten(attribute_list.$0), flatten(attribute_list.$0.length) as len;
d. Reproject your bag:
op= foreach a generate name, BagConcat(height,len,size,len) as attr;
A = LOAD 'PATH' USING PigStorage() AS (ID:INT);
B = FOREACH sourcenew GENERATE *, null as len:int;
You can also give any integer value in place of null.

Pig Latin - Extracting fields meeting two different filter criteria from chararray line and grouping in a bag

I am new to Pig Latin.
I want to extract all lines that match a filter criteria (have a word "line_token" ) from log files and then from these matching lines extract two different fields meeting two separate field match criteria . Since the lines aren't structured well I am loading them as a char array.
When I try to run the following code - I get an error
"Invalid resource schema: bag schema must have tuple as its field"
I have tried to perform an explicit cast to a tuple but that does not work
input_lines = LOAD '/inputdir/' AS ( line:chararray);
filtered_lines = FILTER input_lines BY (line MATCHES '.*line_token1.*' );
tokenized_lines = FOREACH filtered_lines GENERATE FLATTEN(TOKENIZE(line)) AS tok_line;
my_wordbag = FOREACH tokenized_lines {
word1 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
word2 = FILTER tok_line BY ( $0 MATCHES '.*word_token1.*' ) ;
GENERATE word1 , word2 as my_tuple ;
-- I also tried --> GENERATE (word1 , word2) as my_tuple ;
}
dump my_wordbag;
I suppose I am taking a very wrong approach.
Please note - my logs aren't structured well - so I cant mend the way I load
Post loading and initial filtering for lines of interest ( which is straightforward) - I guess I need to do something different rather than tokenize line and iterate through fields trying to find fields.
Or maybe I should use joins ?
Also if I know the structure of line beforehand well as all text fields, then will loading it differently ( not as a chararray) make it an easier problem ?
For now I made a compromise - I added a extra filter clause in my original - line filter and settled for picking just one field from line. When I get back to it I will try with joins and post that code ... - here's my working code that gets me a useful output - but not all that I want.
-- read input lines from poorly structured log
input_lines = LOAD '/log-in-dir-in-hdfs' AS ( line:chararray) ;
-- Filter for line filter criteria and date interested in passed as arg
filtered_lines = FILTER input_lines BY (
( line MATCHES '.*line_filter1*' )
AND ( line MATCHES '.*line_filter2.*' )
AND ( line MATCHES '.*$forDate.*' )
) ;
-- Tokenize every line
tok_lines = FOREACH filtered_lines
GENERATE TOKENIZE(line) AS tok_line;
-- Pick up specific field frm tokenized line based on column filter criteria
fnames = FOREACH tok_lines {
fname = FILTER tok_line BY ( $0 MATCHES '.*field_selection.*' ) ;
GENERATE FLATTEN(fname) as nnfname;
}
-- Count occurances of that field and store it with field name
-- My original intent is to store another field name as well
-- I will do that once I figure how to put both of them in a tuple
flgroup = FOREACH fnames
GENERATE FLATTEN(TOKENIZE((chararray)$0)) as cfname;
grpfnames = group flgroup by cfname;
readcounts = FOREACH grpfnames GENERATE COUNT(flgroup), group ;
STORE readcounts INTO '/out-dir-in-hdfs';
As I understand, after the FLATTEN operation, you have single line (tok_line) in each row and you want to extract 2 words from each line. REGEX_EXTRACT will help you achieve this. I'm not a REGEX expert so will leave writing the REGEX part up to you.
data = FOREACH tokenized_lines
GENERATE
REGEX_EXTRACT(tok_line, <first word regex goes here>) as firstWord,
REGEX_EXTRACT(tok_line, <second word regex goes here>) as secondWord;
I hope this helps.
You must refer to the alias, not the column.
So:
word1 = FILTER tokenized_lines BY ( $0 MATCHES '.*word_token1.*' ) ;
word1 and word2 are going to be aliases as well, not columns.
How do you need the output to look like?

Convesion from Hive to PigLatin

I am trying to convert the below Hive statement to Pig:
max(substr(case when url like 'http:%' then '' else url end,1,50))
My pig statement for the above is:
url_group = GROUP data by (uid);
max_substr_url= FOREACH url_group generate SUBSTRING(MAX(((Coalesce(data.url) matches '.*http:%.*') ? '' : Coalesce(data.url))), 0, 49);
For some of the data, the url can be null. So I have written a pig UDF called Coalesce(String) which returns an empty string if the data is either null or empty. If the data is not null or not empty it returns the string back.
The above pig statement is giving me lot of trouble and tried n different options/ways but nothing worked. Anyone got any ideas on how to implement this? Please help me.
Thanks in advance
You are going to want to use a nested FOREACH so that you can do the substring transformation on each tuple in the data bag then take the MAX of the transformed bag.
A = GROUP data by (uid);
B = FOREACH url_group {
-- MAX needs a one column bag
transformed = FOREACH data
GENERATE SUBSTRING((Coalesce(url) matches '.*http:.*' ? '' : Coalesce(url)), 0, 49);
GENERATE group AS uid, MAX(transformed) ;
}

Split tuple with multiple fields to tuples with single field in Pig

I have tuples with varied length. I am trying to convert them to tuples with only one field(each field is a map).
Original data:
dump entryArray;
([symbol#HIG,security_type#EQUITY,foreign_entry_id#743094])
([symbol#PEW,security_type#EQUITY,foreign_entry_id#743084])
([symbol#AFFY,security_type#EQUITY,foreign_entry_id#5585],[symbol#RFG,security_type#ETF,foreign_entry_id#5586],[symbol#SCHW,security_type#EQUITY,foreign_entry_id#5587],[symbol#VWO,security_type#ETF,foreign_entry_id#5588])
I hope the output would be(each field still be map):
([symbol#HIG,security_type#EQUITY,foreign_entry_id#743094])
([symbol#PEW,security_type#EQUITY,foreign_entry_id#743084])
([symbol#AFFY,security_type#EQUITY,foreign_entry_id#5585])
([symbol#RFG,security_type#ETF,foreign_entry_id#5586])
([symbol#SCHW,security_type#EQUITY,foreign_entry_id#5587])
([symbol#VWO,security_type#ETF,foreign_entry_id#5588])
I have tried: entry = FOREACH entryArray GENERATE FLATTEN(TOBAG()); the output has same format, but it seems that the field is no longer MAP:
entry = FOREACH entryArray GENERATE FLATTEN(TOBAG());
dump entry;
([symbol#HIG,security_type#EQUITY,foreign_entry_id#743094])
([symbol#PEW,security_type#EQUITY,foreign_entry_id#743084])
([symbol#AFFY,security_type#EQUITY,foreign_entry_id#5585])
([symbol#RFG,security_type#ETF,foreign_entry_id#5586])
([symbol#SCHW,security_type#EQUITY,foreign_entry_id#5587])
([symbol#VWO,security_type#ETF,foreign_entry_id#5588])
security_type = FOREACH entry GENERATE FLATTEN($0#'security_type');
it throws:
ERROR 1052: Cannot cast bytearray to map with schema :map
org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1059: <line 18, column 16> Problem while reconciling output schema of ForEach
at org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.throwTypeCheckerException(TypeCheckingRelVisitor.java:141)
at org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.visit(TypeCheckingRelVisitor.java:181)
at org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:75)
......
Any suggestion would be very appreciated. Thanks!

Error 1045 on sum function in pig latin with an int

The following pig latin script:
data = load 'access_log_Jul95' using PigStorage(' ') as (ip:chararray, dash1:chararray, dash2:chararray, date:chararray, date1:chararray, getRequset:chararray, location:chararray, http:chararray, code:int, size:int);
splitDate = foreach data generate size as size:int , ip as ip, FLATTEN(STRSPLIT(date, ':')) as h;
groupedIp = group splitDate by h.$1;
a = foreach groupedIp{
added = foreach splitDate generate SUM(size); --
generate added;
};
describe a;
gives me the error:
ERROR 1045:
<file 3.pig, line 10, column 39> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
This error makes me think I need to cast size as an int, but if i describe my groupedIp field, I get the following schema.
groupedIp: {group: bytearray,splitDate: {(size: int,ip: chararray,h: bytearray)}} which indicates that size is an int, and should be able to be used by the sum function.
Am I calling the sum function incorrectly? Let me know if you would like to see any thing else, such as the input file.
SUM operates on a bag as input, but you pass it the field 'size'.
Try to eliminate the nested foreach and use:
a = foreach groupedIp generate SUM(splitDate.size);
Do some dumps of your data. I'll bet some of the stuff in the size column is non-integer, and Pig runs into that and dies. You could also code up your own isInteger udf to check this before the rest of your processing, and throw out any that aren't integers.
SUM, AVG and COUNT are functions that always work on a bag, therefore group the data and then join with the original set like below:
A = load 'nyse_data.txt' as (exchange:chararray, symbol:chararray,date:chararray, pen:float,high:float, low:float, close:float,volume:int, adj_close:float);
G = group A by symbol;
C = foreach G generate group, SUM(A.open);