apache pig group by output -- remove "(" and "{" - apache-pig

I do the following:
a = load '/hive/warehouse/' USING PigStorage('^') as (a1,b1,c1);
b = group a by (a1) ;
c = foreach b generate group, a.$2;
dump c;
Output shows all the groups:
abc {(1),(44),(66)}
cde {(1),(44),(66)}
How can I remove "{" and "(" characters so that the final HDFS file can be read as a coma delimited file?

You can't do this directly in Pig. The special syntax is required because you are storing a bag, and in order for Pig to be able to read this bag later, it needs to be stored with braces (for the bag) and parentheses (for the tuples contained in the bag).
You have a couple of options. You can read the file back into Pig, but instead of reading it as a bag, read it as a chararray. Then you can perform regex substitution to get rid of the punctuation (untested):
a = LOAD 'output' AS (group:chararray, list:chararray);
b = FOREACH A GENERATE group, REPLACE(list, '[{()}]', '');
Another option is to write a UDF which will turn a bag into a tuple. Note that this is not a well-defined operation: bags have no particular order, so from one run to the next, your tuple is not guaranteed to be in the same order. But for your purposes it sounds like that may not matter. The UDF could look like (very rough draft, untested):
public class BAG_TO_TUPLE extends EvalFunc(Tuple) {
public Tuple exec(Tuple input) {
DataBag bag = input.get(0);
Iterator<Tuple> iterator = bag.iterator();
Tuple out = new DefaultTuple();
while(iterator.hasNext()) {
out.append(iterator.next().get(0));
}
return out;
}
}
The above UDF is terrible -- it assumes that you have exactly one element in every tuple of the bag (that you care about) and does no checking whatsoever that the input is valid, etc. But it should get you towards what you want.
The best solution, though, is to find a way to handle the extra punctuation outside of Pig if Pig is not part of your downstream processing.

This functionality is now provided in Pig as a built-in func (I'm using 0.11).
http://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/BagToString.html
c = foreach b generate group, a.$2 as stuff;
d = foreach c generate group, BagToString(stuff, ',');
I don't need a comma-delimited file for my use case, but I assume you can use a store func to get the final comma (between group and the now-comma-delimited-list of bag things).

Try the FLATTEN operator;
c = foreach b generate group, FLATTEN(a.$2);

Related

What is the Dump extension used for, and why is it so popular?

To me, adding "Dump" to the end of an expression doesn't seem to do anything different, at least for seeing rows in a table. Can you point me to an example of where it is handy?
If you are just working with an expression, there is no reason to call Dump—it's called automatically. But, in the language selection box, LINQPad allows allows the selection of Statements and Program. Once you select one of those, you don't get any Dump output unless you call it.
With Statements or Programs, you might want to call Dump on multiple times. In those cases, it is handy to pass the Description parameters so you can distinguish the output.
There are also other parameters you can use to shape the output, such as depth, which limits the substructure details.
Simple example (Language=C# Statements):
var integers = Enumerable.Range(1,10);
integers.Select(i => new { i, v = i * i}).Dump("Squares");
integers.Select(i => new { i, v = i * i * i}).Dump("Cubes");
var output = "λ is awesome";
Encoding.UTF8.GetBytes(output)
.Dump("UTF-8");
Encoding.GetEncoding("Windows-1252").GetBytes(output)
.Dump("Windows-1252 (lossy)");

apache-pig dse pig flatten usage

When should I use flatten in pig? Not able to understand from the documentation. The error messages shown and the issue are entirely different in Pig. It says sometimes flatten could not be imported but the same flatten works somewhere.
Whenever you use group command for any of the identfier in your data file ,it will list down all the tuples pertaining to the identifier in a bag, which sometimes is quite cumbersome to read.
So if you use flatten on top of the group clause it will list all the tuples separately in your output file .The drawback of using flatten is dulplicacy of the same record.So to remove dulpicate you need to write an extra piece of code.
Example of Non-flattened code:
X = GROUP A BY f1;
DUMP X;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(8,{(8,3,4)})
Example of flattened code:
X = GROUP A BY f1;
DUMP X;
(1,2,3)
(4,2,1)
(4,3,3)
(8,3,4)

How to use Regular Expressions inside treePatterns?

I am working with the example about Parse Tree Matching and XPath shown here. More specifically, I was trying to understand how the following code works:
// assume we are parsing Java
ParserRuleContext tree = parser.compilationUnit();
String xpath = "//blockStatement/*"; // get children of blockStatement
String treePattern = "int <Identifier> = <expression>;";
ParseTreePattern p =
parser.compileParseTreePattern(treePattern,
ExprParser.RULE_localVariableDeclarationStatement);
List<ParseTreeMatch> matches = p.findAll(tree, xpath);
System.out.println(matches);
What I wanted to ask is if we can have regular expressions inside the treePattern string?
For example, I want to write a pattern which identifies all the localVariableDeclarations inside a for loop.
I would like to be able to identify the following code:
for (Object o : list) {
int tempVariable=0;
if ( o.id ==12) {
System.out.println(t);
}
}
The way I have written the pattern (which works) to identify this code is as follows:
String pattern3 = " for ( <className1:type> <localName1:Identifier> : <listName1:expression> ) { <localVariables1:localVariableDeclarationStatement> "
+ "if (<parameter1:expression>.<identifier1:Identifier> == <value1:primary> ) <block1:statement> }";
However, if I have more than one local variables, the pattern doesn't match. I tried to add a '*' at the end as it would happen in the grammar file, but I get an
* invalid tag error.
<localVariables1:localVariableDeclarationStatement>*
Of course I can also add a pattern with two localVariableDeclarationStatement statements, but this again means that I have to create many different patterns for each number of local variables that I want to identify:
<localVariables1:localVariableDeclarationStatement> <localVariables2:localVariableDeclarationStatement> and identify the pattern with
At this time, we don't support repeated elements within the patterns. I thought about that but it essentially means making yet another parser generator whereas static patterns like that are fairly easy to match. It's possible to build one of these, as the last version of ANTLR had tree grammars where you could in fact specify the grammatical structure of subtrees. Until we decide what sort of enhancement to the patterns we can make, I suggest you get creative.
In your specific case, find all of the localVariableDeclarations within for loops as you are doing now and then use a small bit of code to walk that list to identify the contiguous sequences (they are all siblings) and the ones terminated by that particular IF pattern. Would that work?

Pig Nesting STRSPLIT

I have a string in field 'product' in the following form:
";TT_RAV;44;22;"
and am wanting to first split on the ';' and then split on the '_' so that what is returned is
"RAV"
I know that I can do something like this:
parse_1 = foreach {
splitup = STRSPLIT(product,';',3);
generate splitup.$1 as depiction;
};
This will return the string 'TT_RAV' and then I can do another split and project out the 'RAV' however this seems like it will be passing the data through multiple Map jobs -- Is it possible to parse out the desired field in one pass?
This example does NOT work, as the inner splitstring retuns tuples, but shows logic:
c parse_1 = foreach {
splitup = STRSPLIT(STRSPLIT(product,';',3),'_',1);
generate splitup.$1 as depiction;
};
Is it possible to do this in pure piglatin without multiple map phases?
Don't use STRSPLIT. You are looking for REGEX_EXTRACT:
REGEX_EXTRACT(product, '_([^;]*);', 1) AS depiction
If it's important to be able to precisely pick out the second semicolon-delimited field and then the second underscore-delimited subfield, you can make your regex more complicated:
REGEX_EXTRACT(product, '^[^;]*;[^_;]*_([^_;]*)', 1) AS depiction
Here's a breakdown of how that regex works:
^ // Start at the beginning
[^;]* // Match as many non-semicolons as possible, if any (first field)
; // Match the semicolon; now we'll start the second field
[^_;]* // Match any characters in the first subfield
_ // Match the underscore; now we'll start the second subfield (what we want)
( // Start capturing!
[^_;]* // Match any characters in the second subfield
) // End capturing
The only time there will be multiple maps is if you have an operator that triggers a reduce (JOIN, GROUP, etc...). If you run an explain on the script you can see if there is more than one reduce phase.

How to read data from data bag within a PIG script

I have a databag which is the following format
{([ChannelName#{ (bigXML,[])} ])}
DataBag consists of only one item which is a Tuple.
Tuple consists of only item that is Map.
Map is of type which is a map between channel names and values.
Here is value is of type DataBag, which consists of only one tuple.
The tuple consists of two items one is a charrarray (very big string) and other is a map
I have a UDF that emits the above bag.
Now i need to invoke another UDF by passing the only tuple within the DataBag against a given Channel from the Map.
Assuming there was not data bag and a tuple as
([ChannelName#{ (bigXML,[])} ])
I can access the data using $0.$0#'StdOutChannel'
Now with the tuple inside a bag
{([ChannelName#{ (bigXML,[])} ])}
If i do $0.$0.$0#'StdOutChannel' (Prepend $0), i get the following error
ERROR 1052: Cannot cast bag with schema bag({bytearray}) to map
How can I access data within a data bag?
Try to break this problem down a little.
Let's say you get your inner bag:
MYBAG = $0.$0#'StdOutChannel';
First, can you ILLUSTRATE or DUMP this?
What can you do with this bag? Usually FOREACH over the tuples inside.
A = FOREACH MYBAG {
GENERATE $0 AS MyCharArray, $1 AS MyMap
};
ILLUSTRATE A; -- or if this doesn't work
DUMP A;
Can you try this interactively and maybe edit your question a little more with some details as a result of you trying these things.
Some editing hints for StackOverflow:
put backticks around your code (`ILLUSTRATE`)
indent code blocks by 4 spaces on each line