How do I find intersection of a bag and an element without UDF in Pig? - apache-pig

I have say two Pig variables,
p which is (id: int, companies: tuple(name:chararray))
and
q which is (id: int, company: chararray).
Now after I join p and q by their "id"'s, how do I filter out those rows where q::company is not present in p::companies?
PS I went through this question Check if an element is present in a bag? but it seems to be not exactly as my problem.
Example
sample p
1,(c1 c2 c3)
2,(c4 c5 c6)
3,(c2 c3 c5)
sample q
1,c3
2,c8
3,c5
expected output after the joins
1,c3
3,c5

First, you need to convert p so that every combination of ID and company name appears on its own line.
p_flattened = FOREACH p GENERATE
id,
FLATTEN(TOKENIZE(companies.name, ' ')) AS company;
dump p_flattened;
(1,c1)
(1,c2)
(1,c3)
(2,c4)
(2,c5)
(2,c6)
(3,c2)
(3,c3)
(3,c5)
Then join with q to return only IDs and names which appear in both relations and do foreach to get rid of the duplicate fields.
pq_joined = JOIN p_flattened BY (id, company), q BY (id, company);
final = FOREACH pq_joined GENERATE
q::id AS id,
q::company AS company;
dump final;
(1,c3)
(3,c5)

Related

How to select and alias data only from the column that contains a certain string in Oracle SQL

I am working with address records in an Oracle db. Each row contains information on two parents. There are four columns for phone number types and four columns for numbers. The types are Other_No_Type_1, Other_No_Type_2, Other_No_Type_3, Other_No_Type_4 and any one of them might contain a value of either 'Name1:Mobile', 'Name2:Mobile', 'Father Work', or 'Mother Work', That value refers to the number in the next column (Other_No_1, Other_No_2, Other_No_3 or Other_No_4). I need to pull the Other_No_x value when Other_No_Type_x is equal to Name1:Mobile and alias it "contact_1_mobile" and pull Name2:Mobile and alias it "contact_2_mobile". In my SELECT below, you can see that I've just written "a.other_no_1 as contact_1_mobile" for example, but actually that might be retrieving a work number or the Name2:mobile number. This is my first request for help to the forum, so I apologize for probably not presenting my question properly. Thank you for any help you can give. Here is my statement as it stands now:
SELECT final.*
FROM (
SELECT
--Name 1 in P1 household
a.id
,a.name1_web_user_id as contact_1_id
,a.name1_full_name as contact_1_name
,a.other_no_1 as contact_1_mobile (THIS IS MY PROBLEM. "Other_No_1" MAY NOT ACTUALLY BE THE NAME1:MOBILE NUMBER TYPE I NEED. THIS PROBLEM IS THE SAME IN EACH SECTION OF MY STATEMENT.)
,a.email as contact_1_email
--Name 2 in P1 household
,a.name2_web_user_id as contact_2_id
,a.name2_full_name as contact_2_name
,a.other_no_2 as contact_2_mobile (PROBLEM: HERE I ACTUALLY NEED TO FIND THE COLUMN THAT CONTAINS THE "Name2:Mobile" NUMBER)
,a.EMAIL_2 as contact_2_email
FROM rg_student s left outer join rg_addr a on s.id = a.id
WHERE (
(a.addr_code='P1' AND a.rg_active = 'Y') AND ((a.name1_web_user_id is not null) OR (a.name2_web_user_id is not null))
AND a.id in(SELECT id from rg_student where student_group='Student')
)
UNION
SELECT
--Name 1 in P2 household
a.id
,a.name1_web_user_id as contact_3_id
,a.name1_full_name as contact_3_name
,a.other_no_1 as contact_3_mobile (PROBLEM LINE)
,a.email as contact_3_email
--Name 2 in P2 household
,a.name2_web_user_id as contact_4_id
,a.name2_full_name as contact_4_name
,a.other_no_2 as contact_4_mobile (PROBLEM LINE)
,a.EMAIL_2 as contact_4_email
FROM rg_student s left outer join rg_addr a on s.id = a.id
WHERE (
(a.addr_code='P2' AND a.rg_active = 'Y') AND ((a.name1_web_user_id is not null) OR (a.name2_web_user_id is not null))
AND a.id in(SELECT id from rg_student where student_group='Student')
)
)final
ORDER BY final.id
Just use case or decode:
case 'Name1:Mobile'
when Other_No_Type_1 then Other_No_1
when Other_No_Type_2 then Other_No_2
when Other_No_Type_3 then Other_No_3
when Other_No_Type_4 then Other_No_4
end as contact_1_mobile

How to group by more than one row value?

I am working with POSTGRESQL and I can't find out how to solve a problem. I have a model called Foobar. Some of its attributes are:
FOOBAR
check_in:datetime
qr_code:string
city_id:integer
In this table there is a lot of redundancy (qr_code is not unique) but that is not my problem right now. What I am trying to get are the foobars that have same qr_code and have been in a well known group of cities, that have checked in at different moments.
I got this by querying:
SELECT * FROM foobar AS a
WHERE a.city_id = 1
AND EXISTS (
SELECT * FROM foobar AS b
WHERE a.check_in < b.check_in
AND a.qr_code = b.qr_code
AND b.city_id = 2
AND EXISTS (
SELECT * FROM foobar as c
WHERE b.check_in < c.check_in
AND c.qr_code = b.qr_code
AND c.city_id = 3
AND EXISTS(...)
)
)
where '...' represents more queries to get more persons with the same qr_code, different check_in date and those well known cities.
My problem is that I want to group this by qr_code, and I want to show the check_in fields of each qr_code like this:
2015-11-11 14:14:14 => [2015-11-11 14:14:14, 2015-11-11 16:16:16, 2015-11-11 17:18:20] (this for each different qr_code)
where the data at the left is the 'smaller' date for that qr_code, and the right part are all the other dates for that qr_code, including the first one.
Is this possible to do with a sql query only? I am asking this because I am actually doing this app with rails, and I know that I can make a different approach with array methods of ruby (a solution with this would be well received too)
You could solve that with a recursive CTE - if I interpret your question correctly:
Assuming you have a given list of cities that must be visited in order by the same qr_code. Your text doesn't say so, but your query indicates as much.
WITH RECURSIVE
c AS (SELECT '{1,2,3}'::int[] AS cities) -- your list of city_id's here
, route AS (
SELECT f.check_in, f.qr_code, 2 AS idx
FROM foobar f
JOIN c ON f.city_id = c.cities[1]
UNION ALL
SELECT f.check_in, f.qr_code, r.idx + 1
FROM route r
JOIN foobar f USING (qr_code)
JOIN c ON f.city_id = c.cities[r.idx]
WHERE r.check_in < f.check_in
)
SELECT qr_code, array_agg(check_in) AS check_in_list
FROM (
SELECT *
FROM route
ORDER BY qr_code, idx -- or check_in
) sub
HAVING count(*) = (SELECT array_length(cities) FROM c);
GROUP BY 1;
Provide the list as array in the first (non-recursive) CTE c.
In the recursive part start with any rows in the first city and travel along your array until the last element.
In the final SELECT aggregate your check_in column in order. Only return qr_code that have visited all cities of the array.
Similar:
Recursive query used for transitive closure

How to find bad references in a table in Oracle

I have a data problem I need to clean up. Basically I have two tables storing "package" information, one table for documents and one table for audit information. I have entries in the package tables that reference documents that no longer exist and have been replaced (same name but different id) and I want to write a query to find all the bad ones and which new document should replace them. The only thing linking these two is a string value in the audit table which stores the document name (not id).
I've setup a sample schema here: http://sqlfiddle.com/#!4/997bda/1
package_s is the single values for a package in our application
package_r is the repeating values for a package in our application
(these are joined with the same value in the id column)
audit_info is all the audit information in a package
docs is all the documents that can be attached to a package
This query finds the packages with bad attachments (may be more than one per package)
select distinct ps.pkgname, pr.doc_list
from package_s ps, package_r pr
where ps.id = pr.id
and not exists (
select 1 from docs
where pr.doc_list = id
)
order by 1,2 asc
;
I need to build a query with the following rules:
I need to return at least the package id, the position value and the new document id (I will build an update statement to put this new document id in the row matching the package id / position in the package_r table)
the way to get the document name from the audit information is:
SUBSTR(description,0,INSTR(description,'[')-2)
If the document was Added and then Removed, it should be ignored (string_1)
string_2 must not be 'Supporting'
the new document must match
state = 'Master'
latest = 1
pub = '0'
Right now I have a semi-working script that works on a per package basis, but the problem is affecting 2000+ packages. I find the audit entries that don't match documents correctly attached to the package and then search for those names in the document table. The problem with this is since there is no direct link between the package and document tables, if there are multiple problem attachments on one package, each "new" document is returned once per position value, i.e.
package id bad doc id position new doc id
p1 d1 -1 d1-new
p1 d1 -1 d4-new
p1 d4 -2 d1-new
p1 d4 -2 d4-new
It doesn't matter which new id goes into which position value, but the duplication result problem like this makes it hard to mass generate update scripts, some manual filtering would be required.
This is a somewhat complex and unique data issue, so any help would be greatly appreciated.
This query works according to informations provided:
with ai as (
select a1.audited_id id, dc.id doc_id, dc.docname,
row_number() over (partition by a1.audited_id order by dc.id) rn
from audit_info a1
join docs dc
on dc.state = 'Master' and dc.latest = 1 and dc.pub = '0'
and dc.docname = substr(a1.description, 1, instr(a1.description, '[')-2)
where string_1 = 'Added' and string_2 <> 'Supporting'
and not exists (
select * from audit_info a2
where a2.audited_id = a1.audited_id and string_1 = 'Removed'
and a2.description = a1.description )
and not exists ( -- here matching docs are eliminated
select 1 from package_r pr
where pr.id = a1.audited_id and pr.doc_list = dc.id ) ),
p as (
select ps.id, ps.pkgname, pr.doc_list, pr.position,
row_number() over (partition by ps.id order by doc_list) rn
from package_s ps
join package_r pr on pr.id = ps.id
where not exists ( select * from docs where pr.doc_list = docs.id )
)
select p.id, p.pkgname, p.doc_list, p.position
, ai.docname, ai.doc_id
from p join ai on ai.id = p.id and p.rn = ai.rn
order by p.id, p.doc_list, ai.doc_id
Output:
ID PKGNAME DOC_LIST POSITION DOCNAME DOC_ID
-- ------- -------- -------- ------- ------
p1 000001 d3 -3 doc3 d3-new
p1 000001 d4 -4 doc4 d4-new
p2 000002 d5 -2 doc5 d5-new
p4 000004 d6 -1 doc6 d6-new
Edit: Answers to issues reported in comments
it is identifying packages that do not have bad values, and then the doc_list column is blank,
Note that query (my subquery p) for identyfing packages is basically your query, I just added counter there.
I guess that some process/application or someone manually cleared column doc_list in package_r.
If you don't want such entries, just add condition and trim(doc_list) is not null in subquery p.
for the ones it gets right on the package part (they have a bad value) it is bringing back the wrong docname/doc_id to replace the bad value with, it is a different doc_id in the list.
I understand this only partially. Can you add such entries to your examples (in Fiddle or just edit your question and add problematic input rows and expected output for them?)
"It doesn't matter which new id goes into which position value".
Assignment I made this way - if we had two old docs with names "ABC", "DEF" and corrected docs have names "XXA", "DE12"
then they will be linked as "ABC"->"DE12" and "DEF"->"XXA" (alphabetical ordering seems more rational than totally random).
To make assigning random change order by ... to order by null in both row_number() functions.

Pig split and join

I have a requirement to propagate field values from one row to another given type of record
for example my raw input is
1,firefox,p
1,,q
1,,r
1,,s
2,ie,p
2,,s
3,chrome,p
3,,r
3,,s
4,netscape,p
the desired result
1,firefox,p
1,firefox,q
1,firefox,r
1,firefox,s
2,ie,p
2,ie,s
3,chrome,p
3,chrome,r
3,chrome,s
4,netscape,p
I tried
A = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
SPLIT A INTO B IF (type =='p'), C IF (type!='p' );
joined = JOIN B BY id FULL, C BY id;
joinedFields = FOREACH joined GENERATE B::id, B::type, B::browser, C::id, C::type;
dump joinedFields;
the result I got was
(,,,1,p )
(,,,1,q)
(,,,1,r)
(,,,1,s)
(2,p,ie,2,s)
(3,p,chrome,3,r)
(3,p,chrome,3,s)
(4,p,netscape,,)
Any help is appreciated, Thanks.
PIG is not exactly SQL, it is built with data flows, MapReduce and groups in mind (joins are also there). You can get the result using a GROUP BY, FILTER nested in the FOREACH and FLATTEN.
inpt = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
grp = GROUP inpt BY id;
Result = FOREACH grp {
P = FILTER inpt BY type == 'p'; -- leave the record that contain p for the id
PL = LIMIT P 1; -- make sure there is just one
GENERATE FLATTEN(inpt.(id,type)), FLATTEN(PL.browser); -- convert bags produced by group by back to rows
};

Reference columns in a FOREACH after a JOIN?

A = load 'a.txt' as (id, a1);
B = load 'b.txt as (id, b1);
C = join A by id, B by id;
D = foreach C generate id,a1,b1;
dump D;
4th line fails on:
Invalid field projection. Projected field [id] does not exist in schema
I tried to change to A.id but then the last line fails on: ERROR 0: Scalar has more than one row in the output.
What you are looking for is the "Disambiguate Operator". What you want is A::id, not A.id.
A.id says "there is a relation/bag A and there is a column called id in its schema"
A::id says "there is a record from A and that has a column called id"
So, you would do:
A = load 'a.txt' as (id, a1);
B = load 'b.txt as (id, b1);
C = join A by id, B by id;
D = foreach C generate A::id,a1,b1;
dump D;
A dirty alternative:
Just because I'm lazy, and disambiguation gets really weird when you start doing multiple joins one after another: use unique identifiers.
A = load 'a.txt' as (ida, a1);
B = load 'b.txt as (idb, b1);
C = join A by ida, B by idb;
D = foreach C generate ida,a1,b1;
dump D;