PIG REPLACE with NULL - apache-pig

I have three values A, B and C.
I want to be able to replace the value of C with a NULL value if A AND B have values in their cells.
Unsure where to go. I've tried something like
FOR EACH X GENERATE REPLACE(C, ((A IS NOT NULL AND B IS NOT NULL) ? NULL:C) ;
But unsure if this will work, it doesn't seem right. I don't want to add any more values, just update the value of C?
Maybe something like
FOR EACH X GENERATE (A IS NOT NULL AND B IS NOT NULL) ? NULL:C AS NEW_C;
Then drop C, whilst retaining A, B and NEW_C?

You can simply do:
Y = FOREACH X GENERATE A, B, (A IS NOT NULL AND B IS NOT NULL ? NULL : C) AS C;
There is no need to create NEW_C and then drop C since no fields are carried into the new relation unless you explicitly name them (unless you use GENERATE * so that all fields are carried through).

Related

Combining two tuples into one in oracle db

Say I have a bunch of letters grouped together and I want to find out which pair bonds with another the most, as an example I have
a b d b
b c e a
and it should return ab or ba because a-b are the most occurred here.
so far I have made a query that just pulls every two letters that are together, but when I run the query I get something like the above example but all are in separate tuples, like this
a
b
b
c
d
e
b
a
I need to compare the occurence of all the PAIRS, my logic so far is that I can use nvl() to concat them(but nvl() of a-b and b-a returns two separate instances), then run a count, but I'm not sure how to run a count on these as I called the letters
select a.value, b.value
from Letter a, Letter b, Word aw, Word bw, Sentence senA, sentence senb
where a.id = aw.aid and aw.pid = sena.id and b.id = bw.aid and
bw.pid = senb.id and aw.pid = bw.pid and a.value != b.value
;
TL;DR: I want to do a count(a.ltr-b.ltr pair) but not sure how to code that.
Thanks!
EDIT: table structure:
letter(id, value)
\
word(aid, pid)
\
sentence(id, name,sid)
basically, if two letters end up in the same sentence.id, they are a pair(bond).

how to match the two load statements in pig

I have two load statements A and B.
In each one I have a surrogate key. I want to match the surrogate key columns if both keys will match the stored data.
I tried the following code.
A = LOAD 'a/data/' using PigStorage('\t') as (SourceWebSite:chararray,PropertyID:chararray,ListedOn:chararray,ContactName:chararray,TotalViews:int,Price:chararray);
B = LOAD 'b/data/' using PigStorage('\t') as (SourceWebsite:chararray,PropertyType:chararray,IPLSNO:int,Locality:chararray,City:chararray,Price:chararray);
C = COGROUP A BY Price, B BY Price;
D = FOREACH C GENERATE FLATTEN((IsEmpty(A) ? null : A)), FLATTEN((IsEmpty(B) ? null : B));
The above command prints all the data.
If I understand it right you would like to have dose data where both A and B has any data for the given price, am I right?
Than you may have to use filter:
D = FILTER C BY (NOT IsEmpty(A) AND NOT IsEmpty(B));
The D will contain those data rows where both A and B has value for the price used to group.

How to convert a relation of one form into another in Pig

I have a relation
B of the form {A::id:int, A::date:chararray}
C is of the form {id:int, date:chararray}
I want to convert B into C i.e. B' Schema should be id:int, date:chararray
Is there a way to do that ?
Do this:
C2 = foreach B generate
A::id as id,
A::date as date;
The schema of B and C is already the same - both have an int followed by a chararray. if you want to rename the fields you can do the projection suggested by user2303197

How do I add a column, preserving the existing columns, without listing them all?

I want to add a new column to an alias, preserving all the existing ones.
A = foreach A generate
A.id as id,
A.date as date,
A.foo as foo,
A.bar as bar,
A.foo / A.bar as foobar;
Can I do that without listing all of them explicitly?
Yes, let's say you have an alias like:
A: {num1:int, num2:int}
and you want to calculate the sum while keeping num1 and num2. You can do this like:
B = FOREACH A GENERATE *, num1 + num2 AS num3:int ;
DESCRIBE B;
B: {num1:int, num2:int, num3:int}
Used like this, the * operator generates all fields.

How to find the occurrences of a column mapped to a corresponding column in a query SQL

I have a query as below
select custref, tetranumber
from
(select *
from cdsheader h, custandaddr c
where h.custref=c.cwdocid and c.addresstype = 'C' )
where tetranumber = '034096'
The objective is the 2nd column should have only one corresponding 1st column
Ex : 034096 should have always have 2600135 as the first column
I would like to check if there is any value apart from 2600135 for 034096.
(I am a java developer and suggested a solution to avoid 1 to n or n to n mappings of data but there is a bad data already in the DB(Oracle), so I would like to check whether there is a bad data so that I could delete the data)
re: The objective is the 2nd column should have only one corresponding 1st column
You'll need to perform an aggregate function, like MAX or MIN, to determine which of the row is returned.
Thanks for the response guys,
I have figured out the way and here it goes...
select custref, count(distinct(tetranumber)) from(
select custref, tetranumber from cdsheader h, custandaddr c where h.custref=c.cwdocid and c.addresstype = 'C')
group by custref having count(distinct(tetranumber))>1