I have performed an inner join on two tables. However, I am unable to perform the summation on one of the cloumns:
Queries performed:
sample1 = load '/user/tweets/samples.csv' using PigStorage AS (line:chararray);
words = FOREACH sample1 GENERATE FLATTEN(TOKENIZE(REPLACE(LOWER(TRIM(line)),'[\\p{Punct},\\p{Cntrl}]',''))) AS word
newinnerjoin = join words by word, wordlexion by lexword;
Below is the output of the table: newinnerjoin
(important,important,2)
(irritated,irritated,-3)
(promoting,promoting,1)
(promoting,promoting,1)
(appreciate,appreciate,2)
(confidence,confidence,2)
I want to perform the aggregation on column 3 of the inner join results.
So, I would like the sum to be calculated as 2 + -3 + 1 + 1 + 2 + 2 = 5
IS there a way i can do without storing the inner join results in csv file ?
Please advise.
Thanks
Can you add the below 3 lines of code and let me know the result?.
A = GROUP newinnerjoin ALL;
B = FOREACH A GENERATE SUM(newinnerjoin.$2);
DUMP B;
Related
I have a problem with my query where it duplicate as follow
![result query][1]
As you can see my rows multiple the number of rows by itself, as its the same id
do you know how to avoid it ?
[1]: https://i.stack.imgur.com/a4uS8.png
SELECT
distinct
tt.sourceId, tt.name,tt.date,
ts.content_plainText, ts.itemId,
ttk.content_Plaintext, ttk.ticketId
FROM ticket_tickets as tt
inner join ticket_ticketSolutions ts
on tt.sourceId = ts.itemId
inner join ticket_ticketTasks ttk
on ts.itemId = ttk.ticketId
would it be possible to do like one rows equal 3 match result ??
I have even try the groupby id but won't work ?
I have 4 tables and the following SQL query:
SELECT * FROM dbo.Synola, dbo.Stores, dbo.Fpa, dbo.Nomismata
WHERE
dbo.Stores.Store_id = dbo.Synola.Store_id
AND
dbo.Stores.fpa_id = dbo.Fpa.fpa_id
AND
dbo.Stores.nomisma_id = dbo.Nomismata.nomisma_id
The above works fine and without errors.
My problem is that when I am trying to loop the above query:
Currently, in my Stores TABLE, I have only 2 stores and I want in the loop to get results ONLY for 2 records. The 2 records with my 2 stores. But unfortunately I am receiving more than 2 records.
Which is the correct syntax of my query in a way to receive results only for my 2 stores in the loop?
This should only retrieve data that's available in ALL tables. Therefore, it there's not a match to the main Stores table, then the result wont show.
SELECT *
FROM Stores s
JOIN Synola sy ON sy.Store_id = s.Store_id
JOIN Fpa f ON f.fpa_id = s.fpa_id
JOIN Nomismata n ON n.nomisma_id = s.nomisma_id
If you are getting more rows because there are more matches in other tables, then you need to look into adding more WHERE conditions or using another type of JOIN.
More info: https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
Edit: To see where you are getting multiple matches on your JOIN:
SELECT s.Store_id "Store_id from Store",
,sy.Store_id "Store_id from Synola"
,s.fpa_id "fpa_id from Store"
,f.fpa_id "fpa_id from Fpa"
,s.nonisma_id "nomisma_id from Store"
,n.nonisma_id "nomisma_id from Nomismata"
FROM Stores s
JOIN Synola sy ON sy.Store_id = s.Store_id
JOIN Fpa f ON f.fpa_id = s.fpa_id
JOIN Nomismata n ON n.nomisma_id = s.nomisma_id
SELECT sc.TAAC_SHARE_CLASS_ID,
SCS.SHARE_CLASS_SID,
SCS.REPORTING_DT,
SCS.SHARE_CLASS_SNAPSHOT_SID,
SCS.DIST_UNMOD_30_DAY_YIELD_PCT,
SCS.DER_DIST_12_MO_YIELD_PCT,
SCS.DER_SEC_30_DAY_YIELD_PCT AS SCS_DER_SEC_30_DAY_YIELD_PCT,
SCS.DER_SEC_RESTATED_YIELD_PCT AS SCS_DER_SEC_RESTATED_YIELD_PCT
FROM SHARE_CLASS sc
INNER JOIN PORTFOLIO P ON (P.PORTFOLIO_SID=SC.PORTFOLIO_SID)
INNER JOIN SHARE_CLASS_SNAPSHOT SCS ON
(SCS.SHARE_CLASS_SID=sc.SHARE_CLASS_SID)
WHERE SCS.REPORTING_DT = '24-JUL-17' AND P.PORTFOLIO_ID = 638;
I ran this query and got the following output : image
Here, instead of getting separate rows for the same TAAC_SHARE_CLASS_ID, I want to merge the outputs of same TAAC_SHARE_CLASS_ID.
For example, the first row with TAAC_SHARE_CLASS_ID = 000648 should have values for all the 4 columns :
SCS.DIST_UNMOD_30_DAY_YIELD_PCT,
SCS.DER_DIST_12_MO_YIELD_PCT,
SCS.DER_SEC_30_DAY_YIELD_PCT,
SCS.DER_SEC_RESTATED_YIELD_PCT.
Hence the first row should have values for those columns as 2.96,3.2972596, 7541.085263433, 7550.
The last 4 rows of my output are not really required, as we have now merged those data into first 4 rows correspondingly.
How can I alter this query to achieve the same? Please help.
I suggest you group your results by TAAC_SHARE_CLASS_ID column, and MAX() the remaining columns, something like this:
SELECT sc.TAAC_SHARE_CLASS_ID,
max(SCS.SHARE_CLASS_SID) as SHARE_CLASS_SID,
max(SCS.REPORTING_DT) as REPORTING_DT,
max(SCS.SHARE_CLASS_SNAPSHOT_SID) as SHARE_CLASS_SNAPSHOT_SID,
max(SCS.DIST_UNMOD_30_DAY_YIELD_PCT) as DIST_UNMOD_30_DAY_YIELD_PCT,
max(SCS.DER_DIST_12_MO_YIELD_PCT) as DER_DIST_12_MO_YIELD_PCT,
max(SCS.DER_SEC_30_DAY_YIELD_PCT) AS SCS_DER_SEC_30_DAY_YIELD_PCT,
max(SCS.DER_SEC_RESTATED_YIELD_PCT) AS SCS_DER_SEC_RESTATED_YIELD_PCT
FROM SHARE_CLASS sc
INNER JOIN PORTFOLIO P ON (P.PORTFOLIO_SID=SC.PORTFOLIO_SID)
INNER JOIN SHARE_CLASS_SNAPSHOT SCS ON (SCS.SHARE_CLASS_SID=sc.SHARE_CLASS_SID)
WHERE SCS.REPORTING_DT = '24-JUL-17' AND P.PORTFOLIO_ID = 638
GROUP BY sc.TAAC_SHARE_CLASS_ID;
My title is probably not very clear, so I made a little schema to explain what I'm trying to achieve. The xxxx_uid labels are foreign keys linking two tables.
Goal: Retrieve a column from the grids table by giving a proj_uid value.
I'm not very good with SQL joins and I don't know how to build a single query that will achieve that.
Actually, I'm doing 3 queries to perform the operation:
1) This gives me a res_uid to work with:
select res_uid from results where results.proj_uid = VALUE order by res_uid asc limit 1"
2) This gives me a rec_uid to work with:
select rec_uid from receptor_results
inner join results on results.res_uid = receptor_results.res_uid
where receptor_results.res_uid = res_uid_VALUE order by rec_uid asc limit 1
3) Get the grid column I want from the grids table:
select grid_name from grids
inner join receptors on receptors.grid_uid = grids.grid_uid
where receptors.rec_uid = rec_uid_VALUE;
Is it possible to perform a single SQL that will give me the same results the 3 I'm actually doing ?
You're not limited to one JOIN in a query:
select grids.grid_name
from grids
inner join receptors
on receptors.grid_uid = grids.grid_uid
inner join receptor_results
on receptor_results.rec_uid = receptors.rec_uid
inner join results
on results.res_uid = receptor_results.res_uid
where results.proj_uid = VALUE;
select g.grid_name
from results r
join resceptor_results rr on r.res_uid = rr.res_uid
join receptors rec on rec.rec_uid = rr.rec_uid
join grids g on g.grid_uid = rec.grid_uid
where r.proj_uid = VALUE
a small note about names, typically in sql the table is named for a single item not the group. thus "result" not "results" and "receptor" not "receptors" etc. As you work with sql this will make sense and names like you have will seem strange. Also, one less character to type!
I am new to Pig and am trying to understand the basic commands. I have a data set A which I inner joined to data set B. I want to keep only some of the variables in the resultant data set. How do I do that? This is what I have so far
A = LOAD 'science_scores';
B = LOAD 'math_scores';
AB = JOIN A BY Name, B BY Student_Name;
Now both A and B have a lot of other columns that I don't need. In SQL I would do something like this:
SELECT A.science_score, B.math_score
FROM A
INNER JOIN B
ON A.Name = B.Student_Name
Can someone please help me figure how to do this?
Thanks!
You are looking for the FOREACH and GENERATE keywords.
selected = FOREACH AB GENERATE science_score, math_score;
A = LOAD 'science_scores';
B = LOAD 'math_scores';
AB = JOIN A BY Name, B BY Student_Name;
dump AB;
Please refer this below link.
How can I do this inner join properly in Apache PIG?