Find all paths of at most length 2 from a set of relationships - apache-pig

I have a connection data set with each row marks A connects B in the form A B. The direct connection between A and B appears only once, either in the form A B or B A. I want to find all the connections at most one hop away, i.e. A and C are at most one hop away, if A and C are directly connected, or A connects C through some B.
For example, I have the following direct connection data
1 2
2 4
3 7
4 5
Then the resulting data I want is
1 {2,4}
2 {1,4,5}
3 {7}
4 {1,2,5}
5 {2,4}
7 {3}
Could anybody help me to find a way as efficient as possible? Thank you.

You could do this:
myudf.py
#outputSchema('bagofnums: {(num:int)}')
def merge_distinct(b1, b2):
out = []
for ignore, n in b1:
out.append(n)
for ignore, n in b2:
out.append(n)
return out
script.pig
register 'myudf.py' using jython as myudf ;
A = LOAD 'foo.in' USING PigStorage(' ') AS (num: int, link: int) ;
-- Essentially flips A
B = FOREACH A GENERATE link AS num, num AS link ;
-- We need to union the flipped A with A so that we will know:
-- 3 links to 7
-- 7 links to 3
-- Instead of just:
-- 3 links to 7
C = UNION A, B ;
-- C is in the form (num, link)
-- You can't do JOIN C BY link, C BY num ;
-- So, T just is C replicated
T = FOREACH D GENERATE * ;
D = JOIN C BY link, T BY num ;
E = FOREACH (FILTER E BY $0 != $3) GENERATE $0 AS num, $3 AS link_hopped ;
-- The output from E are (num, link) pairs where the link is one hop away. EG
-- 1 links to 2
-- 2 links to 4
-- 3 links to 7
-- The output will be:
-- 1 links to 4
F = COGROUP C BY num, E BY num ;
-- I use a UDF here to merge the bags together. Otherwise you will end
-- up with a bag for C (direct links) and E (links one hop away).
G = FOREACH F GENERATE group AS num, myudf.merge_distinct(C, E) ;
Schema and output for G using your sample input:
G: {num: int,bagofnums: {(num: int)}}
(1,{(2),(4)})
(2,{(4),(1),(5)})
(3,{(7)})
(4,{(5),(2),(1)})
(5,{(4),(2)})
(7,{(3)})

Related

PostgreSQL data transformation - Turn rows into columns

I have a table whose structure looks like the following:
k | i | p | v
Notice that the key (k) is not unique, there are no keys, nothing. Each key can have multiple attributes (i = 0, 1, 2, ...) which can be of different types (p) and have different values (v). One attribute type may also appear multiple times (p(i-1) = p(i)).
What I want to do is pick certain attribute types and their corresponding values and place them in the same row. For example I want to have:
k | attr_name1 | attr_name2
I have managed to make a query that does this and works for all keys (k) for which attr_name1 and attr_name2 appear in the column p of the initial table:
SELECT DISTINCT ON (key) fn.k AS key, fn.v AS attr_name1, a.v AS attr_name2
FROM Table fn
LEFT JOIN Table a ON fn.k = a.k
AND a.p = 'attr_name2'
WHERE fn.p = 'attr_name1'
I would like, however, to take into account the case where a certain key has no attribute named attr_name1 and insert a NULL value into the corresponding column of the new table. I am not sure how to achieve that. I have no issue using multiple queries or intermediate tables etc, but there are quite a lot of rows in the table and I need something that scales to millions of rows.
Any help would be appreciated.
Example:
k i p v
1 0 a 10
1 1 b 12
1 2 c 34
1 3 d 44
1 4 e 09
2 0 a 11
2 1 b 13
2 2 d 22
2 3 f 34
Would turn into (assuming I am only interested in columns a, b, c):
k a b c
1 10 12 34
2 11 13 NULL
I would use conditional aggregation. That is, an aggregate function around a CASE expression.
SELECT
k,
MAX(CASE WHEN p='a' THEN v END) AS a,
MAX(CASE WHEN p='b' THEN v END) AS b,
MAX(CASE WHEN p='c' THEN v END) AS c
FROM
your_table
GROUP BY
k
This presumes that (k, p) is unique. If there are duplicate keys, this will clearly find the one v with the highest value (for each (k,p))
As a general rule this kind of pivoting makes the data harder to process in SQL. This is often done for display purposes because humans find this easier to read. However, from a software engineering perspective, such formatting should not be done in the data layer; be careful that by doing this you don't actually make your future life harder.

SQL dealing every bit without run query repeatedly

I have a column using bits to record status of every mission. The index of bits represents the number of mission while 1/0 indicates if this mission is successful and all bits are logically isolated although they are put together.
For instance: 1010 is stored in decimal means a user finished the 2nd and 4th mission successfully and the table looks like:
uid status
a 1100
b 1111
c 1001
d 0100
e 0011
Now I need to calculate: for every mission, how many users passed this mission. E.g.: for mission1: it's 0+1+1+0+1 = 5 while for mission2, it's 0+1+0+0+1 = 2.
I can use a formula FLOOR(status%POWER(10,n)/POWER(10,n-1)) to get the bit of every mission of every user, but actually this means I need to run my query by n times and now the status is 64-bit long...
Is there any elegant way to do this in one query? Any help is appreciated....
The obvious approach is to normalise your data:
uid mission status
a 1 0
a 2 0
a 3 1
a 4 1
b 1 1
b 2 1
b 3 1
b 4 1
c 1 1
c 2 0
c 3 0
c 4 1
d 1 0
d 2 0
d 3 1
d 4 0
e 1 1
e 2 1
e 3 0
e 4 0
Alternatively, you can store a bitwise integer (or just do what you're currently doing) and process the data in your application code (e.g. a bit of PHP)...
uid status
a 12
b 15
c 9
d 4
e 3
<?php
$input = 15; // value comes from a query
$missions = array(1,2,3,4); // not really necessary in this particular instance
for( $i=0; $i<4; $i++ ) {
$intbit = pow(2,$i);
if( $input & $intbit ) {
echo $missions[$i] . ' ';
}
}
?>
Outputs '1 2 3 4'
Just convert the value to a string, remove the '0's, and calculate the length. Assuming that the value really is a decimal:
select length(replace(cast(status as char), '0', '')) as num_missions as num_missions
from t;
Here is a db<>fiddle using MySQL. Note that the conversion to a string might look a little different in Hive, but the idea is the same.
If it is stored as an integer, you can use the the bin() function to convert an integer to a string. This is supported in both Hive and MySQL (the original tags on the question).
Bit fiddling in databases is usually a bad idea and suggests a poor data model. Your data should have one row per user and mission. Attempts at optimizing by stuffing things into bits may work sometimes in some programming languages, but rarely in SQL.

PIG Script to handle nth-1 record

Input File Structure : records are sorted based on the time stamp
Expected input fiel size will be :2-3TB
timestamp
==============
20141014120523
20141014120534
20141014120537
20141014120542
20141014120549
20141014120555
20141014120565
20141014120570
20141014120512
...
...
Using PIG I need to find the time difference between the Nth record and Nth-1 Record time stamp (20141014120534 - 20141014120523 = 11 secs).
I need to loop through all the records to get the time difference from previous record
Example Output
0
11
3
5
...
Please help me with the right resources/references/solutions.
Can you try this?
input.txt
20141014120523
20141014120534
20141014120537
20141014120542
20141014120549
20141014120555
20141014120565
20141014120570
PigScript:
A = LOAD 'input.txt' using PigStorage() as (time:long);
B = RANK A;
C = FILTER B BY rank_A;
D = FILTER B BY rank_A > 1;
E = FOREACH D GENERATE ($0-1),$1;
F = JOIN B BY $0, E BY $0;
G = FOREACH F GENERATE (E::time - B::time);
DUMP G;
Output:
(11)
(3)
(5)
(7)
(6)
(10)
(5)

Apache Pig : Append two data sets to one

i have two data sets
1st set A
(111)
(222)
(555)
2nd set B
(333)
(444)
(666)
i did
C = UNION A,B;
after appending two data sets output should be first data set and next second data set
Expected output C is
(111)
(222)
(555)
(333)
(444)
(666)
But my output C is
(333)
(444)
(666)
(111)
(222)
(555)
if i apply union the result is in not order
it is difficult to me to append them in set order
How can i do this ?
i cant think of any but any help will be appreciated.
Add an extra column to each of the files giving the file_number and then do union of the modified data sets, followed by sorting based on the column giving 'file_number'
A = LOAD 'A.txt' USING PigStorage() AS (a:int);
B = LOAD 'B.txt' USING PigStorage() AS (b:int);
A_mod = FOREACH A GENERATE a, 1 AS file_number;
B_mod = FOREACH A GENERATE b, 2 AS file_number;
unified_mod = UNION A_mod, B_mod;
output = SORT unified_mod BY file_number;
I've try the classic union and for me the data stay in order.
But let's try to force-it if it doesn't :)
well as I said in the previous comment it's not efficient but it makes the job.
--In order to determine nbA you can run the following cmd in the shell : wc -l A.txt
%default nbA 3
A = LOAD 'A.txt' USING PigStorage() AS (a:int);
B = LOAD 'B.txt' USING PigStorage() AS (b:int);
A = RANK A;
B = RANK B;
--DESCRIBE B;
B = FOREACH B GENERATE rank_B + $nbA, $1;
C= UNION B,A;
C= ORDER C BY $0;
C= FOREACH C GENERATE $1; --If you want to drop the first column
DUMP C;
Output :
(111)
(222)
(555)
(333)
(444)
(666)
Where :
A.txt
111
222
555
And B.txt:
333
444
666

Selecting random tuple from bag

Is it possible to (efficiently) select a random tuple from a bag in pig?
I can just take the first result of a bag (as it is unordered), but in my case I need a proper random selection.
One (not efficient) solution is counting the number of tuples in the bag, take a random number within that range, loop through the bag, and stop whenever the number of iterations matches my random number. Does anyone know of faster/better ways to do this?
You could use RANDOM(), ORDER and LIMIT in a nested FOREACH statement to select one element with the smallest random number:
inpt = load 'group.txt' as (id:int, c1:bytearray, c2:bytearray);
groups = group inpt by id;
randoms = foreach groups {
rnds = foreach inpt generate *, RANDOM() as rnd; -- assign random number to each row in the bag
ordered_rnds = order rnds by rnd;
one_tuple = limit ordered_rnds 1; -- select tuple with the smallest random number
generate group as id, one_tuple;
};
dump randoms;
INPUT:
1 a r
1 a t
1 b r
1 b 4
1 e 4
1 h 4
1 k t
2 k k
2 j j
3 a r
3 e l
3 j l
4 a r
4 b t
4 b g
4 h b
4 j d
5 h k
OUTPUT:
(1,{(1,b,r,0.05172709255901231)})
(2,{(2,k,k,0.14351660053632986)})
(3,{(3,e,l,0.0854104195792681)})
(4,{(4,h,b,8.906013598960483E-4)})
(5,{(5,h,k,0.6219490873384448)})
If you run "dump randoms;" multiple times, you should get different results for each run.
Writing a UDF might give you better performance as you do not need to do secondary sort on random within the bag.
I needed to do this myself, and found surprisingly that a very simple answer seems to work, to get about 10% of an alias A:
B = filter A by RANDOM() < 0.1