I am new to Pig and am trying to understand the basic commands. I have a data set A which I inner joined to data set B. I want to keep only some of the variables in the resultant data set. How do I do that? This is what I have so far
A = LOAD 'science_scores';
B = LOAD 'math_scores';
AB = JOIN A BY Name, B BY Student_Name;
Now both A and B have a lot of other columns that I don't need. In SQL I would do something like this:
SELECT A.science_score, B.math_score
FROM A
INNER JOIN B
ON A.Name = B.Student_Name
Can someone please help me figure how to do this?
Thanks!
You are looking for the FOREACH and GENERATE keywords.
selected = FOREACH AB GENERATE science_score, math_score;
A = LOAD 'science_scores';
B = LOAD 'math_scores';
AB = JOIN A BY Name, B BY Student_Name;
dump AB;
Please refer this below link.
How can I do this inner join properly in Apache PIG?
Related
I am new to building sql queries and could use some help. I built a query that works fine as a standalone query. The problem is I need to use it in a report using ExecuteScalar function and nested queries are not allowed, I tried to rebuild using joins but I seem to be lost.
Can anyone help me "un-nest" this query?
SELECT
StockType2Job.Loaded
FROM
StockType2Job
WHERE
StockType2Job.IdStockType =
(SELECT StockType.IdStockType
FROM StockType
WHERE StockType.Number = '1001716.00')
AND
StockType2Job.IdStockType2JobGroup =
(SELECT StockType2JobGroup.IdStockType2JobGroup
FROM StockType2JobGroup
WHERE StockType2JobGroup.IdJob =
(SELECT Job.IdJob
FROM Job
WHERE Job.Number = '18-0085.02'
AND StockType2JobGroup.Caption = 'Breakout Room 1'))
Any help appreciated. Thanks
this query should work(on Oracle DB):
SELECT
StockType2Job.Loaded
FROM
(((StockType2Job a JOIN StockType b ON a.IdStockType=b.IdStockType)
JOIN StockType2JobGroup c ON a.IdStockType2JobGroup=c.IdStockType2JobGroup)
JOIN Job d ON c.IdJob=d.IdJob)
WHERE
b.Number = '1001716.00' AND
d.Number = '18-0085.02' AND
c.Caption = 'Breakout Room 1'
I have the following two tables
activity
activity_bak
I want to take the comments from activity_bak and update the comments in activity to match by using the corresponding activity_no and activity_seq.
I've tried doing it like this but to no success:
update Animal.sysadm.activity
set activity_comment = ab.activity_comment
from Animal.SYSADM.activity a
left join Animal.SYSADM.activity_bak ab
on ab.activity_no = a.activity_no
left join Animal.sysadm.activity_bak ab2
on ab2.activity_seq = a.activity_seq
Any help or pointers would be greatly appreciated.
No need to do 2 joins, you need just one. The right syntax is:
UPDATE a
SET a.activity_comment = ab.activity_comment
FROM Animal.SYSADM.activity a
INNER JOIN Animal.SYSADM.activity_bak ab
ON ab.activity_no = a.activity_no
AND ab.activity_seq = a.activity_seq;
I think you want:
update A set activity_comment = ab.activity_comment
from Animal.SYSADM.activity a
left join Animal.SYSADM.activity_bak ab
on ab.activity_no = a.activity_no
And ab.activity_seq = a.activity_seq
I have performed an inner join on two tables. However, I am unable to perform the summation on one of the cloumns:
Queries performed:
sample1 = load '/user/tweets/samples.csv' using PigStorage AS (line:chararray);
words = FOREACH sample1 GENERATE FLATTEN(TOKENIZE(REPLACE(LOWER(TRIM(line)),'[\\p{Punct},\\p{Cntrl}]',''))) AS word
newinnerjoin = join words by word, wordlexion by lexword;
Below is the output of the table: newinnerjoin
(important,important,2)
(irritated,irritated,-3)
(promoting,promoting,1)
(promoting,promoting,1)
(appreciate,appreciate,2)
(confidence,confidence,2)
I want to perform the aggregation on column 3 of the inner join results.
So, I would like the sum to be calculated as 2 + -3 + 1 + 1 + 2 + 2 = 5
IS there a way i can do without storing the inner join results in csv file ?
Please advise.
Thanks
Can you add the below 3 lines of code and let me know the result?.
A = GROUP newinnerjoin ALL;
B = FOREACH A GENERATE SUM(newinnerjoin.$2);
DUMP B;
I would like to get ,what item was bought very recently by each person. Assume that a same person can buy many items.
below are the input details
kumar,2014-09-30,television
kumar,2014-07-27,smartphone
Andrew,2014-06-21,camera
Andrew,2014-05-20,car
I need the output as below
kumar,2014-09-30,television
Andrew,2014-06-21,camera
I wrote a Pig script upto this, but after that i dont know how to proceed,can somebody help me
A = LOAD 'records.txt' USING PigStorage(',') AS(name:chararray,date:chararray,item:chararray);
B = GROUP A BY name;
C = FOREACH B GENERATE group,MAX(A.date);
But i need to get the item that was purchased recently by each person. How do i get that. If i apply GROUP then i am supposed to use only aggregate function in Pig.
How do i get the recepective item that was purchased?
Use bags and order by in a nested foreach, it will use only 1 MR job and is more in Apache Pig style.
A = LOAD 'input.txt' USING PigStorage(',') AS(name:chararray,date:chararray,item:chararray);
B = GROUP A BY name;
C = FOREACH B {
ordered = ORDER A BY date DESC; -- this will cause secondary sort to optimise the execution
latest = LIMIT ordered 1;
GENERATE FLATTEN(latest); - advantage of PIG, that all columns are preserved and not dropped as on SQL group by
};
DUMP C;
Also use of $0, $1 etc is convenient, but imagine you have a script with hundreds of lines and tens of group by and join operations that project using '$', it is nightmare to understand the flow of information/columns though such scripts. Time wasted in maintenance and making changes to such scripts is huge.
I hope this works for you.
input.txt
kumar,2014-09-30,television
kumar,2014-07-27,smartphone
Andrew,2014-06-21,camera
Andrew,2014-05-20,car
PigScript:
A = LOAD 'input.txt' USING PigStorage(',') AS(name:chararray,date:chararray,item:chararray);
B = GROUP A BY name;
C = FOREACH B GENERATE group,FLATTEN(MAX($1.date));
D = JOIN A BY date,C BY $1;
E = FOREACH D GENERATE $0,$1,$2;
DUMP E;
Output:
(Andrew,2014-06-21,camera)
(kumar,2014-09-30,television)
I`m working on some sql queries to get some data out of a table; I have made 2 queries for the
same data but both give another result. The 2 queries are:
SELECT Samples.Sample,
data_overview.Sample_Name,
data_overview.Sample_Group,
data_overview.NorTum,
data_overview.Sample_Plate,
data_overview.Sentrix_ID,
data_overview.Sentrix_Position,
data_overview.HybNR,
data_overview.Pool_ID
FROM tissue INNER JOIN (
( patient INNER JOIN data_overview
ON patient.Sample = data_overview.Sample)
INNER JOIN Samples ON
(data_overview.Sample_id = Samples.Sample_id) AND
(patient.Sample = Samples.Sample)
) ON
(tissue.Sample_Name = data_overview.Sample_Name) AND
(tissue.Sample_Name = patient.Sample_Name)
WHERE data_overview.Sentrix_ID= 1416198
OR data_overview.Pool_ID='GS0005701-OPA'
OR data_overview.Pool_ID='GS0005702-OPA'
OR data_overview.Pool_ID='GS0005703-OPA'
OR data_overview.Pool_ID='GS0005704-OPA'
OR data_overview.Sentrix_ID= 1280307
ORDER BY Samples.Sample;")
And the other is
SELECT Samples.Sample,
data_overview.Sample_Name,
data_overview.Sample_Group,
data_overview.NorTum,
data_overview.Sample_Plate,
data_overview.Sentrix_ID,
data_overview.Sentrix_Position,
data_overview.HybNR,
data_overview.Pool_ID
FROM tissue INNER JOIN
(
(patient INNER JOIN data_overview
ON patient.Sample = data_overview.Sample)
INNER JOIN Samples ON
(data_overview.Sample_id = Samples.Sample_id)
AND (patient.Sample = Samples.Sample)) ON
(tissue.Sample_Name = data_overview.Sample_Name)
AND (tissue.Sample_Name = patient.Sample_Name)
WHERE ((
(data_overview.Sentrix_ID)=1280307)
AND (
(data_overview.Pool_ID)="GS0005701-OPA"
OR (data_overview.Pool_ID)="GS0005702-OPA"
OR (data_overview.Pool_ID)="GS0005703-OPA"
OR (data_overview.Pool_ID)="GS0005704-OPA"))
OR (((data_overview.Sentrix_ID)=1416198))
ORDER BY data_overview.Sample;
The one in the top is working quite well but it still won't filter the sentrix_ID.
The second 1 is created with Access but when I try to run this Query in R it gave
a unexpected symbol error. So if anyone knows how to create a query that filter POOL_ID and Sentrix_id with the given parameters thanks in advance
Is it a case of making the where clause something like this:
WHERE Sentrix_ID = 1280307 AND (Pool_ID = 'VAL1' OR Pool_ID = 'VAL2' OR Pool_ID = 'VAL3')
i.e. making sure you have brackets around the "OR" components?
Maybe you meant:
...
WHERE data_overview.Sentrix_ID IN (1280307,1416198 )
AND data_overview.Pool_ID IN ("GS0005701-OPA", "GS0005702-OPA", "GS0005703-OPA" ,"GS0005704-OPA")
;