Apache Pig : Append two data sets to one

Apache Pig : Append two data sets to one - apache-pig

i have two data sets
1st set A
(111)
(222)
(555)
2nd set B
(333)
(444)
(666)
i did
C = UNION A,B;
after appending two data sets output should be first data set and next second data set
Expected output C is
(111)
(222)
(555)
(333)
(444)
(666)
But my output C is
(333)
(444)
(666)
(111)
(222)
(555)
if i apply union the result is in not order
it is difficult to me to append them in set order
How can i do this ?
i cant think of any but any help will be appreciated.

Add an extra column to each of the files giving the file_number and then do union of the modified data sets, followed by sorting based on the column giving 'file_number'
A = LOAD 'A.txt' USING PigStorage() AS (a:int);
B = LOAD 'B.txt' USING PigStorage() AS (b:int);
A_mod = FOREACH A GENERATE a, 1 AS file_number;
B_mod = FOREACH A GENERATE b, 2 AS file_number;
unified_mod = UNION A_mod, B_mod;
output = SORT unified_mod BY file_number;

I've try the classic union and for me the data stay in order.
But let's try to force-it if it doesn't :)
well as I said in the previous comment it's not efficient but it makes the job.
--In order to determine nbA you can run the following cmd in the shell : wc -l A.txt
%default nbA 3
A = LOAD 'A.txt' USING PigStorage() AS (a:int);
B = LOAD 'B.txt' USING PigStorage() AS (b:int);
A = RANK A;
B = RANK B;
--DESCRIBE B;
B = FOREACH B GENERATE rank_B + $nbA, $1;
C= UNION B,A;
C= ORDER C BY $0;
C= FOREACH C GENERATE $1; --If you want to drop the first column
DUMP C;
Output :
(111)
(222)
(555)
(333)
(444)
(666)
Where :
A.txt
111
222
555
And B.txt:
333
444
666

Related

Find continuity of elements in Pig

how can i find the continuity of a field and starting position
The input is like
A-1
B-2
B-3
B-4
C-5
C-6
The output i want is
A,1,1
B,3,2
C,2,5
Thanks.

Assuming you do not have discontinuous data with respect to a value, you can get the desired results by first grouping on value and using COUNT and MIN to get continuous_counts and start_index respectively.
A = LOAD 'data' USING PigStorage('-') AS (value:chararray;index:int);
B = FOREACH (GROUP A BY value) GENERATE
group as value,
COUNT(A) as continuous_counts,
MIN(A.value) as start_index;
STORE B INTO 'output' USING PigStorage(',');
If your data does have the possibility of discontinuous data, the solution is not longer trivial in native pig and you might need to write a UDF for that purpose.

Group and count the number of values for continous_counts. i.e.
A,1
B,3
C,2
Get the top row for each value. i.e.
A,1
B,2
C,5
Join the above two relations and get the desired output.
A = LOAD 'data.txt' USING PigStorage('-') AS (value:chararray;index:int);
B = GROUP A BY value;
C = FOREACH B GENERATE group as value,COUNT(A.value) as continuous_counts;
D = FOREACH B {
ordered = ORDER B BY index;
first = LIMIT ordered 1;
GENERATE first.value,first.index;
}
E = JOIN C BY value,D BY value;
F = FOREACH E GENERATE C::value,C::continuous_counts,D::index;
DUMP F;

Issue in Loading data from Movielens into pig

I'm trying to load some data into Pig:
Record:
11::American President, The (1995)::Comedy|Drama|Romance
12::Dracula: Dead and Loving It (1995)::Comedy|Horror
Script Used:
loadMoviesDs = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat'
USING PigStorage(':')
AS (Movieid:long, dummy1, Title:chararray, dummy2, Genere:chararray);
Output
11,,American President, The (1995),,Comedy|Drama|Romance
12,,Dracula,, Dead and Loving It (1995)
How to tackle the colon(:) after Dracula.-?
due to the colon, the second column is getting split into 2 columns and since we have in total of 3 columns, the last column of movieid 12 comedy|horror doesn't get loaded.

You can achieve this using REGEX_EXTRACT_ALL.
Following is the piece of code, which achieves this:
A = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat'
AS (f1:chrarray);
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(f1, '(.*)::(.*)::(.*)');
C = FOREACH B GENERATE FLATTEN($0);
D = FOREACH C GENERATE $0 AS (MovieID:long), $1 AS (Title:chararray), $2 AS (Genre:chararray);
DUMP D;
I got the following output (which is a tuple). ":" after "Dracula" is intact.
(11,American President, The (1995),Comedy|Drama|Romance)
(12,Dracula: Dead and Loving It (1995),Comedy|Horror)

Pig and Parsing issue

I am trying to figure out the best way to parse key value pair with Pig in a dataset with mixed delimiters as below
My sample dataset is in the format below
a|b|c|k1=v1 k2=v2 k3=v3
The final output which i require here is
k1,v1,k2,v2,k3,v3
I guess one way to do this is to
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
and here i get (k1=v1 k2=v2 k3=v3) for B
Is there any way i can further parse this by "" so as to get 3 fields k1=v1,k2=v2 and K3=v3 which can then be further split into k1,v1,k2,v2,k3,v3 using Strsplit and Flatten on "=".
Thanks for the help!
San

If you know beforehand how many key=value pair are in each record, try this:
A = load 'sample' PigStorage('|') as (a1,b1,c1,d1);
B = foreach A generate d1;
C = FOREACH B GENERATE STRSPLIT($0,'=',6); -- 6= no. of key=value pairs
D = FOREACH C GENERATE FLATTEN($0);
DUMP D
output:
(k1,v1, k2,v2, k3,v3)
If you dont know the # of key=value pair, use ' ' as delimiter and remove the unwanted prefix from $0 column.
A = LOAD 'sample' USING PigStorage(' ') as (a:chararray,b:chararray,c:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2);
DUMP C;
output:
(k1,v1, k2,v2, k3,v3)

Split string and use last value?

I would like to split a string fields into parts (space separator) and use the last value of a field. I know i can split data using strsplit, but how i can take the last value?
eg: input:
AAA BB CC
SS DD
AA
output:
CC
DD
AA
thanks

You can do that with a combination of LAST INDEX_OF, SUBSTRING and SIZE.

input
AAA BB CC
SS DD
AA
A = load 'input.txt' as (line : chararray);
B = FOREACH A generate line, LAST_INDEX_OF(line,' ') AS ind;
C = FOREACH B GENERATE (ind>0?SUBSTRING(line,ind+1,ind+3):SUBSTRING(line,0,2));
Dump C;
output
(CC)
(DD)
(AA)
if last value size is not same in this case use size() instead of ind+3

One more solution. It works well for all combination.
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE REGEX_EXTRACT(line,'\\s*(\\w+)$',1);
DUMP B;
Output:
(CC)
(DD)
(AA)

Find all paths of at most length 2 from a set of relationships

I have a connection data set with each row marks A connects B in the form A B. The direct connection between A and B appears only once, either in the form A B or B A. I want to find all the connections at most one hop away, i.e. A and C are at most one hop away, if A and C are directly connected, or A connects C through some B.
For example, I have the following direct connection data
1 2
2 4
3 7
4 5
Then the resulting data I want is
1 {2,4}
2 {1,4,5}
3 {7}
4 {1,2,5}
5 {2,4}
7 {3}
Could anybody help me to find a way as efficient as possible? Thank you.

You could do this:
myudf.py
#outputSchema('bagofnums: {(num:int)}')
def merge_distinct(b1, b2):
out = []
for ignore, n in b1:
out.append(n)
for ignore, n in b2:
out.append(n)
return out
script.pig
register 'myudf.py' using jython as myudf ;
A = LOAD 'foo.in' USING PigStorage(' ') AS (num: int, link: int) ;
-- Essentially flips A
B = FOREACH A GENERATE link AS num, num AS link ;
-- We need to union the flipped A with A so that we will know:
-- 3 links to 7
-- 7 links to 3
-- Instead of just:
-- 3 links to 7
C = UNION A, B ;
-- C is in the form (num, link)
-- You can't do JOIN C BY link, C BY num ;
-- So, T just is C replicated
T = FOREACH D GENERATE * ;
D = JOIN C BY link, T BY num ;
E = FOREACH (FILTER E BY $0 != $3) GENERATE $0 AS num, $3 AS link_hopped ;
-- The output from E are (num, link) pairs where the link is one hop away. EG
-- 1 links to 2
-- 2 links to 4
-- 3 links to 7
-- The output will be:
-- 1 links to 4
F = COGROUP C BY num, E BY num ;
-- I use a UDF here to merge the bags together. Otherwise you will end
-- up with a bag for C (direct links) and E (links one hop away).
G = FOREACH F GENERATE group AS num, myudf.merge_distinct(C, E) ;
Schema and output for G using your sample input:
G: {num: int,bagofnums: {(num: int)}}
(1,{(2),(4)})
(2,{(4),(1),(5)})
(3,{(7)})
(4,{(5),(2),(1)})
(5,{(4),(2)})
(7,{(3)})

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Apache Pig : Append two data sets to one - apache-pig

Related

Find continuity of elements in Pig

Issue in Loading data from Movielens into pig

Pig and Parsing issue

Split string and use last value?

Find all paths of at most length 2 from a set of relationships

Categories

Resources