I have 50 fields, Is there any option in pig to print first 40 field in Apache Pig? I require something like range $0-$39 - apache-pig

I have 50 fields, Is there any option in pig to print first 40 fields? I require something like range $0-$39.
I don’t want to specify each and every field like $0, $1,$2 etc
Giving every column when the number of columns is less is acceptable but when there are a huge number of columns what is the case?

You can use the .. notation.
First 40 fields
B = FOREACH A GENERATE $0..$39;
All fields
B = FOREACH A GENERATE $0..;
Multiple ranges,for example first 10,15-20,25-50
B = FOREACH A GENERATE $0..$9,$14..$19,$24..;
Random fields 22,33-44,46
B = FOREACH A GENERATE $21,$32..$43,$45;

Related

How to extract keys from map?

How do I extract all keys from a map field?
I have a bag of tuples where one of the fields is a map that contains HTTP headers (and their values). I want to create a set of all possible keys (in my dataset) for a HTTP header and count how many times I've seen them.
Ideally, something like:
A = LOAD ...
B = FOREACH A GENERATE KEYS(http_headers)
C = GROUP FLATTEN(B) BY $0
D = FOREACH C GENERATE group, COUNT($0)
(didn't test it but it illustrates the idea..)
How do I do something like this? If I can extract a bag of keys from a map it would actually solve it. I just couldn't find any function like this in piglatin's documentation.
Yes there is a command in Pig to accomplish this.
Example:
/* data */
[a#1,b#2,c#3]
[green#sam,eggs#I,ham#am]
A = load 'data' as (M:[]);
B = foreach A generate KEYSET($0);
dump B
Output:
({(b),(c),(a)})
({(ham),(eggs),(green)})

How to change a particular column value for certain number of rows in Pig latin

I have a a pig file with say 10000 rows. Is there any quick way where I can change the value of a certain column for say first 1000 rows ?
Since some info is missing, I will make a few assumptions, and then offer a solution.
by "first 1000 rows" you mean that you can order them records using some column
you which to change the value of column $1 in first 1000 records when ordering by column $2
The following code snippet will do what you asked for:
a = load ...
b = rank a by $2;
c = foreach b generate $0, (rank_a<1000?$1:3*$1), $2..;
Use For Each and Limit Operations to achieve the effect.

Lagging/Differences in Pig

Is there a way to generate "column" lags or differences in Pig? Here's an example of what I'm trying to do, transcribed from R:
Value Value.Diff
1 0.209 NA
2 0.198 -0.011
3 0.187 -0.011
4 0.176 -0.011
5 0.168 -0.008
6 0.159 -0.009
I realize this might be tricky given the (presumably) distributed nature of Pig's data storage, but thought it might be possible given that Pig 0.11+ allows you to rank tuples.
Something like this should work:
values = rank values by some_field;
values = foreach values generate $0 as this_rank, $0 - 1 as prev_rank, value;
copy = foreach values generate *;
pairs = join values by this_rank, copy by prev_rank;
diffs = foreach pairs generate this_rank, values::value - copy::value as diff;

Counting result lines in pig latin

I'm trying to run simple word counter in pig latin as follows:
lines = LOAD 'SOME_FILES' using PigStorage('#') as (line:chararray);
word = FILTER lines BY (line matches '.*SOME_VALUE.*');
I want to count how many SOME_VALUEs found searching SOME_FILES, so the expected output should be something like:
(SOME_VALUE,xxxx)
Where xxxx, is the total number of SOME_VALUE found.
How can I search for multiple values and print each one as above ?
What you should do is split each line into a bag of tokens, then FLATTEN it. Then you can do a GROUP on the words to pull all occurrences of each word into it's own line. Once you do a COUNT of the resulting bag you'll have the total count for all words in the document.
This will look something like:
B = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) ;
C = GROUP B BY $0 ;
D = FOREACH C GENERATE group AS word, COUNT(B) AS count ;
If you aren't sure what each step is doing, then you can use DESCRIBE and DUMP to help visualize what is happening.
Update: If you want to filter the results to contain only the couple of strings you want you can do:
E = FILTER D BY (word == 'foo') OR
(word == 'bar') OR
(word == 'etc') ;
-- Another way...
E = FILTER D BY (word matches 'foo|bar|etc') ;
However, you can also do this between B and C so you don't do any COUNTs you don't need to.

PIG - Defining the delimiter used for a bag after a GROUP function

In Pig, I'm loading and grouping two files. I end up with a something like this:
A = LOAD 'File1' Using PigStorage('\t');
B = LOAD 'File2' Using PigStorage('\t');
C = COGROUP A BY $0, B BY $0;
STORE C INTO 'Output' USING PigStorage('\t');
Output:
123 {(123,XYZ,456)} {(123,QRS,889,QWER)}
Where the first field is the group key, the first bag is from File1, and the next bag is from File2. These three sections are delimited from each other using whatever I identified in the PigStorage('\t') clause.
Question: How do I force Pig to delimit the bags by something other than a comma? In my real data, there are commas present and so I need to delimit by tabs instead.
Desired output:
123 {(123\tXYZ\t456)} {(123\tQRS\t889\tQWER)}
This seems to be an open issue (as of June 2013) in Pig. See the corresponding JIRA for more details. Until the issue is fixed, you can change your input data.