How do I extract all keys from a map field?
I have a bag of tuples where one of the fields is a map that contains HTTP headers (and their values). I want to create a set of all possible keys (in my dataset) for a HTTP header and count how many times I've seen them.
Ideally, something like:
A = LOAD ...
B = FOREACH A GENERATE KEYS(http_headers)
C = GROUP FLATTEN(B) BY $0
D = FOREACH C GENERATE group, COUNT($0)
(didn't test it but it illustrates the idea..)
How do I do something like this? If I can extract a bag of keys from a map it would actually solve it. I just couldn't find any function like this in piglatin's documentation.
Yes there is a command in Pig to accomplish this.
Example:
/* data */
[a#1,b#2,c#3]
[green#sam,eggs#I,ham#am]
A = load 'data' as (M:[]);
B = foreach A generate KEYSET($0);
dump B
Output:
({(b),(c),(a)})
({(ham),(eggs),(green)})
Related
I have 50 fields, Is there any option in pig to print first 40 fields? I require something like range $0-$39.
I don’t want to specify each and every field like $0, $1,$2 etc
Giving every column when the number of columns is less is acceptable but when there are a huge number of columns what is the case?
You can use the .. notation.
First 40 fields
B = FOREACH A GENERATE $0..$39;
All fields
B = FOREACH A GENERATE $0..;
Multiple ranges,for example first 10,15-20,25-50
B = FOREACH A GENERATE $0..$9,$14..$19,$24..;
Random fields 22,33-44,46
B = FOREACH A GENERATE $21,$32..$43,$45;
I am trying to figure out the best way to parse key value pair with Pig in a dataset with mixed delimiters as below
My sample dataset is in the format below
a|b|c|k1=v1 k2=v2 k3=v3
The final output which i require here is
k1,v1,k2,v2,k3,v3
I guess one way to do this is to
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
and here i get (k1=v1 k2=v2 k3=v3) for B
Is there any way i can further parse this by "" so as to get 3 fields k1=v1,k2=v2 and K3=v3 which can then be further split into k1,v1,k2,v2,k3,v3 using Strsplit and Flatten on "=".
Thanks for the help!
San
If you know beforehand how many key=value pair are in each record, try this:
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
C = FOREACH B GENERATE STRSPLIT($0,'=',6); -- 6= no. of key=value pairs
D = FOREACH C GENERATE FLATTEN($0);
DUMP D
output:
(k1,v1, k2,v2, k3,v3)
If you dont know the # of key=value pair, use ' ' as delimiter and remove the unwanted prefix from $0 column.
A = LOAD 'sample' USING PigStorage(' ') as (a:chararray,b:chararray,c:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2);
DUMP C;
output:
(k1,v1, k2,v2, k3,v3)
I would like to write a pig script for below query.
Input is:
ABC,DEF,,
,,GHI,JKL
MNO,PQR,,
,,STU,VWX
Output should be:
ABC,DEF,GHI,JKL
MNO,PQR,STU,VWX
Could anyone please help me?
It will be difficult to solve this problem using native pig. One option could be download the datafu-1.2.0.jar library and try the below approach.
input.txt
ABC,DEF,,
,,GHI,JKL
MNO,PQR,,
,,STU,VWX
PigScript:
REGISTER /tmp/datafu-1.2.0.jar;
DEFINE BagSplit datafu.pig.bags.BagSplit();
A = LOAD 'input.txt' USING PigStorage(',') AS(f1,f2,f3,f4);
B = GROUP A ALL;
C = FOREACH B GENERATE FLATTEN(BagSplit(2,$1)) AS mybag;
D = FOREACH C GENERATE FLATTEN(STRSPLIT(REPLACE(BagToString(mybag),'_null_null_null_null',''),'_',4));
E = FOREACH D GENERATE $2,$3,$0,$1;
DUMP E;
Output:
(MNO,PQR,STU,VWX)
(ABC,DEF,GHI,JKL)
Note:
Based on the above input format, my assumption will be 1st row last two cols will be null, 2nd row first two cols will be null, similarly for 3rd and 4th row also
I'm trying to figure out how many words their are per line in a file in pig. I've gotten as far as loading and splitting:
raw = load file;
words = FOREACH raw GENERATE TOKENIZE(*);
which gets me a bag of tulples each containing a word. Then I go to count these items I get an error:
counts = FOREACH words GENERATE COUNT(*);
I get an error:
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing count in COUNT
...
Caused by: java.lang.NullPointerException
is that because some of the lines have an empty bag? or is there something else I'm doing wrong?
if it is the problem with an empty bag then you can try something like this: (Not tested)
raw = load file;
words = FOREACH raw GENERATE TOKENIZE(*) as tokenized_words;
counts = FOREACH words GENERATE ((tokenized_words IS null or TRIM(tokenized_words) == '') ? 0 : COUNT(*)) as total_count;
here we are writing if-else condition to check if the tokenized_words is null or empty, if yes then we are assigning zero to it else the total count.
Can you try like this?
input
Hi hello how are you
this is apache pig
works
like a charm
Pigscript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE TOKENIZE(line);
C = FOREACH B GENERATE COUNT($0);
DUMP C;
Output:
(5)
(4)
(1)
()
(3)
I just started learning pig and I have a problem with re-naming an alias. Want I want to do is read a file, filter it, and then join it by itself. What I did is this:
register s3n://uw-cse-344-oregon.aws.amazon.com/myudfs.jar
raw = LOAD 's3n://uw-cse-344-oregon.aws.amazon.com/cse344-test-file' USING TextLoader as (line:chararray);
ntriples = foreach raw generate FLATTEN(myudfs.RDFSplit3(line)) as (subject:chararray,predicate:chararray,object:chararray);
ntriples2 = foreach raw generate FLATTEN(myudfs.RDFSplit3(line)) as (subject2:chararray,predicate2:chararray,object2:chararray);
X = FILTER ntriples BY (subject matches '.*business.*');
X2 = FILTER ntriples2 BY (subject2 matches '.*business.*');
joined= join X by subject, X2 by subject2;
joined = DISTINCT joined;
store joined into '/user/hadoop/join-results' using PigStorage();
as you can see I read and filter the file twice in order two have two different alias for each column. How can I simply copy the filtered collection and assign it new aliases? This operation was supposed to take 18 mins, but took 1.5 hour.
I found the answer:
X = FILTER ntriples BY (subject matches '.*rdfabout\\.com.*') PARALLEL 50;
y = foreach X generate subject as subject2, predicate as predicate2, object as object2 PARALLEL 50;
This is how you make copy of x and changing the aliases.