I would like to transform a map into a field in a Pig Latin script - apache-pig

The description of my tuples of my relation (A) is as follows:
{a: int, b: int, c: map[]}
the map contains only one chararray but the key is not predictable. For example, a sample of my tuples is:
(1, 100, [key.152#hello])
(8, 110, [key.3000#bonjour])
(5, 103, [key.1#hallo])
(5, 103, [])
(8, 104, [key.11#buenosdias])
...
I would like to transform my relation (A) into a B relation so the B description would be:
{a: int, b: int, c: chararray}
With my sample, it would give:
(1, 100, hello)
(8, 110, bonjour)
(5, 103, hallo)
(8, 104, buenosdias)
...
(I want to filter empty maps too)
Any ideas?
Thank you.

Though writing the UDF is the right solution, if you want to hack something quick following solution using Regex might help.
A = LOAD 'sample.txt' as (a:int, b:int, c:chararray);
B = FOREACH A GENERATE a, b, FLATTEN(STRSPLIT(c, '#', 2)) as (key:chararray, value:chararray);
C = FOREACH B GENERATE a, b, FLATTEN(STRSPLIT(value, ']', 2)) as (value:chararray, ignore:chararray);
D = FILTER C BY value is not null;
E = FOREACH D GENERATE a, b, value;
STORE E INTO 'output/E';
For sample input
1 100 [key.152#hello]
8 110 [key.3000#bonjour]
5 103 [key.1#hallo]
5 103 []
8 104 [key.11#buenosdias]
The above code produces following output:
1 100 hello
8 110 bonjour
5 103 hallo
8 104 buenosdias

Related

How to find the row and column number of a specific cell in sql?

I have a table in SQL database and I want to find the location of a cell like a coordinate and vice versa. Here is an example:
0 1 2 3
1 a b c
2 g h i
3 n o j
When I ask for i, I want to get row=2 and column=3. When I ask for a cell of row=2 and column=3, I want to get i.
You need to store your matrix in table specifying the columns and rows like this
create table matrix (
row int,
column int,
value varchar2(20)
);
Then you insert your data like this
insert into matrix values (1, 1, 'a');
insert into matrix values (1, 2, 'b');
//and so on.
And then you can simply find what you need using two queries
select column, row from matrix where value = 'i';
select value from matrix where column = 2 and row = 3;
In Oracle, you would do:
select "3"
from t
where "0" = 2;
Naming columns as numbers is not recommended. Your whole data model is strange for SQL. A better representation would be:
row col val
1 1 a
1 2 b
1 3 c
2 1 g
. . .
Then you could do:
select val
from grid
where row = 2 and col = 3;
Create a primary key column such as 'id' and for example, the related row is 'col'
select col from db where id = 2;
this returns you a specific cell (x,2)

Using Pig conditional operator to implement or?

Let's say I have some table f, consisting of the following columns:
a, b
0, 1
0, 0
0, 0
0, 1
1, 0
1, 1
I want to create a new column, c, that is equal to a | b.
I've tried the following:
f = foreach f generate a, b, ((a or b) == 1) ? 1 : 0 as c;
but receive the following error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: NoViableAltException(91#[])
The OR condition construction is not correct, Can you try this?
f = foreach f generate a, b, (((a==1) or (b==1))?1:0) AS c;
Sample example:
input:
0,1
0,0
0,0
0,1
1,0
1,1
PigScript:
A = LOAD 'input' USING PigStorage(',') AS (a:int,b:int);
B = foreach A generate a, b, (((a==1) or (b==1))?1:0) AS c;
DUMP B;
Output:
(0,1,1)
(0,0,0)
(0,0,0)
(0,1,1)
(1,0,1)
(1,1,1)

Looping through element in Pig to generate a new tuple for relation

Say I have a relation as follow:
(A, (1, 2, 3))
(B, (2, 3))
Is it possible to make a new relation by expanding the bag element as follow using Pig Latin?
(A, 1)
(A, 2)
(A, 3)
(B, 2)
(B, 3)
I tried using FOREACH and GENERATE, but I am having difficulty generating a new tuple while looping through a bag element.
Thanks,
-------------
EDIT
-------------
Here's a sample input:
A 1 2 3
B 2 3
Separated by tab and then a whitespace.
I used STRSPLIT to handle whitespace to generate a tuple.
raw_x = LOAD './sample.txt' using PigStorage('\t') AS (title:chararray, links:chararray);
data_x = FOREACH raw_x GENERATE title, STRSPLIT(links, '\\s+') AS links;
Can you try this?
input.txt
A 1 2 3
B 2 3
PigScript:
A = LOAD 'input.txt' USING PigStorage() AS (title:chararray,links:chararray);
B = FOREACH A GENERATE title,FLATTEN(TOKENIZE(links));
DUMP B;
Output:
(A,1)
(A,2)
(A,3)
(B,2)
(B,3)

Pick a random value from a bag

Have grouped data in the relation B in the format
1, {(1,abc), (1,def)}
2, {(2,ghi), (2,mno), (2,pqr)}
Now I wan to pick a random value from the bag and I want the output like
1, abc
2, mno
In case we picked up like first tuple for 1 or second tuple for 2
The issue is I have only grouped data B;
DESCRIBE B
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
If I try to flatten it by
C = FOREACH B GENERATE FLATTEN($1)
DESCRIBE C;
C: {A::id: int,A::min: chararray,A::fan: chararray,A::max: chararray}
Then I try to do
rand =
FOREACH B {
shuf_ = FOREACH C GENERATE RANDOM() AS r, *; line L
shuf = ORDER shuf_ BY r;
pick1 = LIMIT shuf 1;
GENERATE
group,
FLATTEN(pick1);
};
I get an error at line L an error at this point "Pig script failed to parse: expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)"
You cant refer to C when doing a FOREACH on B. Because C is built from B. You need to use projection that B is built from , i.e A
Looking at your describe schemas
B: {group: int,A: {(id: int,min: chararray,fan: chararray,max: chararray)}}
Why cant you to use A, as it will work

MySQL String Comparison with Percent Output

I am trying to compare two entries of 6 numbers, each number which can either can be zero or 1 (i.e 100001 or 011101). If 3 out of 6 match, I want the output to be .5. If 2 out of 6 match, i want the output to be .33 etc.
Here are the SQL commands to create the table
CREATE TABLE sim
(sim_key int,
string int);
INSERT INTO sim (sim_key, string)
VALUES (1, 111000);
INSERT INTO sim (sim_key, string)
VALUES (2, 111111);
My desired output to compare the two strings, which share 50% of the characters, and output 50%.
Is it possible to do this sort of comparison in SQL? Thanks in advance
This returns the percentage of equal 1 bits in both strings:
select bit_count(conv(a.string, 2, 10) & conv(b.string, 2, 10))/6*100 as percent_match
from sim a, sim b where
a.sim_key=1 and b.sim_key=2;
As you store your bitfields as base 2 representation converted to numbers, we first need to do conversions: conv(a.string, 2, 10), conv(b.string, 2, 10).
Then we keep only bits that are 1 in each field: conv(a.string, 2, 10) & conv(b.string, 2, 10)
And we count them: bit_count(conv(a.string, 2, 10) & conv(b.string, 2, 10))
And finally we just compute the percentage: bit_count(conv(a.string, 2, 10) & conv(b.string, 2, 10)) / 6 * 100.
The query returns 50 for 111000 and 111111.
Here is an other version that also counts matching zeros:
select bit_count((conv(a.string, 2, 10) & conv(b.string, 2, 10)) | ((0xFFFFFFFF>>(32-6))&~(conv(a.string, 2, 10)|conv(b.string, 2, 10))))/6*100 as percent_match
from sim a, sim b where
a.sim_key=1 and b.sim_key=2;
Note that, while this solution works, you should really store this field like this instead:
INSERT INTO sim (sim_key, string)
VALUES (1, conv("111000", 2, 10));
INSERT INTO sim (sim_key, string)
VALUES (2, conv("111111", 2, 10));
Or to update existing data:
UPDATE sim SET string=conv(string, 10, 2);
Then this query gives the same results (if you updated your data as described above):
select bit_count(a.string & b.string)/6*100 as percent_match
from sim a, sim b where
a.sim_key=1 and b.sim_key=2;
And to count zeros too:
select bit_count((a.string & b.string) | ((0xFFFFFFFF>>(32-6))&~(a.string|b.string)))/6*100 as percent_match
from sim a, sim b where
a.sim_key=1 and b.sim_key=2;
(replace 6s by the size of your bitfields)
Since you are storing them as numbers, you can do this
SELECT BIT_COUNT(s1.string & s2.string) / BIT_COUNT(s1.string | s1.string)
FROM sim s1, sim s2
WHERE s1.sim_key = 1 AND s2.sim_key = 2