Unable to find record from outer join - apache-pig

I have two relations whose schema and data is as below:
latest_extract
ticket_num,employee_id,assigned_to,team,assigned_date
1234567,1122525,michael,printer,2019-01-03
1234569,1122536,julie,printer,2019-01-03
1234571,1122538,priscila,printer,2019-01-03
1234572,1122539,susan,scanner,2019-01-03
1234573,1122540,walter,network,2019-01-03
previous_extract
ticket_num,employee_id,assigned_to,team,assigned_date
1234567,1122525,michael,printer,2019-01-02
1234568,1122525,michale,printer,2019-01-02
1234569,1122536,julie,printer,2019-01-02
1234570,1122537,john,scanner,2019-01-02
1234574.1122541,hudson,windows,2019-01-02
join_latest_previous = JOIN previous_extract BY (ticket_num,employee_id) FULL OUTER, latest_extract BY (ticket_num,employee_id);
latest_extract::ticket_num,latest_extract::employee_id,latest_extract::assigned_to,latest_extract::team,latest_extract::assigned_date,
previous_extract::ticket_num,previous_extract::employee_id,previous_extract::assigned_to,previous_extract::team,previous_extract::assigned_date;
1234567,1122525,michael,printer,2019-01-03,1234567,1122525,michael,printer,2019-01-02
,,,,,1234568,1122525,michale,printer,2019-01-02
1234569,1122536,julie,printer,2019-01-03,1234569,1122536,julie,printer,2019-01-02
1234571,1122538,priscila,printer,2019-01-03,,,,,
1234572,1122539,susan,scanner,2019-01-03,,,,,
,,,,,1234570,1122537,john,scanner,2019-01-02
,,,,,1234573,1122540,walter,network,2019-01-03
1234574.1122541,hudson,windows,2019-01-02,,,,,,
I need to do below:
if there is single emp in team and record of employee does not exist in previous extract but in latest then flag is 1,
else if there is single emp in team and record of employee does not exist in latest extract but in previous then flag is 2,
else if there are multiple emp in team and record of employee does not exist in previous extract but in latest then flag is 3,
else if there are multiple emp in team and record of employee does not exist in latest extract but in previous then flag is 4,
else it would be 5.
diff_latest_previous = FOREACH join_latest_previous GENERATE
((((previous_extract::ticket_num IS NULL) AND (latest_extract::ticket_num IS NOT NULL))OR (previous_extract::ticket_num !=latest_extract::ticket_num))?1:
(((previous_extract::ticket_num IS NOT NULL) AND (latest_extract::ticket_num IS NULL))OR (previous_extract::ticket_num !=latest_extract::ticket_num))?2:
3) AS flag, latest_extract::ticket_num AS l_ticket_num,latest_extract::employee_id AS l_employee_id,latest_extract::assigned_to AS l_assigned_to,latest_extract::team AS l_team,latest_extract::assigned_date AS l_assigned_date,previous_extract::ticket_num AS p_ticket_num,previous_extract::employee_id AS p_employee_id,previous_extract::assigned_to AS p_assigned_to,previous_extract::team AS p_team,previous_extract::assigned_date AS p_assigned_date;
flag,ticket_num,employee_id,assigned_to,team,assigned_date
5,1234567,1122525,michael,printer,2019-01-03,1234567,1122525,michael,printer,2019-01-02
1,,,,,1234568,1122525,michale,printer,2019-01-02
5,1234569,1122536,julie,printer,2019-01-03,1234569,1122536,julie,printer,2019-01-02
2,1234571,1122538,priscila,printer,2019-01-03,,,,,
2,1234572,1122539,susan,scanner,2019-01-03,,,,,
1,,,,1234570,1122537,john,scanner,2019-01-02
1,,,,,,1234573,1122540,walter,network,2019-01-03
2,1234574.1122541,hudson,windows,2019-01-02,,,,,,
Here, I am unable to get the value of 3 and 4.
Please help me to find a way for that.

Related

SQL select database library

How to print all readers, where time between last two borrows is more than 2 months?
select
name, surname, max(k1.borrow_date)
from
k_person
join
k_reader using(person_id)
join
k_rent_books k1 using(reader_id)
join
k_rent_books k2 using(reader_id)
where
months_between(add_months((k1.borrow_date),-2),k2.borrow_date) > 2
group by
name, surname, person_id
order by
surname;
But i dont know how to say that compare two last dates.
Thanks for help.
Due to some restrictions with the USING clause (e.g. ORA-25154), I had to switch the join syntax, but here's one option. Basically the way to find the last and second last borrow dates for a reader is as follows:
Join to one copy of the K_RENT_BOOKS (K_RB1) table and finds the row with the latest BORROW_DATE for the current reader (from K_READER).
Next, it joins to a second copy of K_RENT_BOOKS (K_RB2), again for
the current reader and finds the latest BORROW_DATE that is not the
one found in the first copy (K_FB1).
Keep the resulting joined record if the last borrow date is two
months after the 2nd last borrow date.
--
select k_p.name, k_rb1.borrow_date, k_rb2.borrow_date
from k_person k_p
inner join
k_reader k_r
on k_p.person_id = k_r.person_id
inner join
k_rent_books k_rb1
on k_rb1.reader_id = k_r.reader_id
inner join
k_rent_books k_rb2
on k_rb2.reader_id = k_r.reader_id
where k_rb1.borrow_date = (select max(borrow_date)
from k_rent_books k_rb3
where k_rb3.reader_id = k_r.reader_id
)
and k_rb2.borrow_date = (select max(borrow_date)
from k_rent_books k_rb4
where k_rb4.reader_id = k_r.reader_id
and k_rb4.borrow_date <> k_rb1.borrow_date
)
and months_between(k_rb1.borrow_date, k_rb2.borrow_date) > 2
There are other ways of doing this that may be faster (e.g. using a with clause that generates the last and second last borrow dates for all readers) but hopefully this provides a starting point.

How to find bad references in a table in Oracle

I have a data problem I need to clean up. Basically I have two tables storing "package" information, one table for documents and one table for audit information. I have entries in the package tables that reference documents that no longer exist and have been replaced (same name but different id) and I want to write a query to find all the bad ones and which new document should replace them. The only thing linking these two is a string value in the audit table which stores the document name (not id).
I've setup a sample schema here: http://sqlfiddle.com/#!4/997bda/1
package_s is the single values for a package in our application
package_r is the repeating values for a package in our application
(these are joined with the same value in the id column)
audit_info is all the audit information in a package
docs is all the documents that can be attached to a package
This query finds the packages with bad attachments (may be more than one per package)
select distinct ps.pkgname, pr.doc_list
from package_s ps, package_r pr
where ps.id = pr.id
and not exists (
select 1 from docs
where pr.doc_list = id
)
order by 1,2 asc
;
I need to build a query with the following rules:
I need to return at least the package id, the position value and the new document id (I will build an update statement to put this new document id in the row matching the package id / position in the package_r table)
the way to get the document name from the audit information is:
SUBSTR(description,0,INSTR(description,'[')-2)
If the document was Added and then Removed, it should be ignored (string_1)
string_2 must not be 'Supporting'
the new document must match
state = 'Master'
latest = 1
pub = '0'
Right now I have a semi-working script that works on a per package basis, but the problem is affecting 2000+ packages. I find the audit entries that don't match documents correctly attached to the package and then search for those names in the document table. The problem with this is since there is no direct link between the package and document tables, if there are multiple problem attachments on one package, each "new" document is returned once per position value, i.e.
package id bad doc id position new doc id
p1 d1 -1 d1-new
p1 d1 -1 d4-new
p1 d4 -2 d1-new
p1 d4 -2 d4-new
It doesn't matter which new id goes into which position value, but the duplication result problem like this makes it hard to mass generate update scripts, some manual filtering would be required.
This is a somewhat complex and unique data issue, so any help would be greatly appreciated.
This query works according to informations provided:
with ai as (
select a1.audited_id id, dc.id doc_id, dc.docname,
row_number() over (partition by a1.audited_id order by dc.id) rn
from audit_info a1
join docs dc
on dc.state = 'Master' and dc.latest = 1 and dc.pub = '0'
and dc.docname = substr(a1.description, 1, instr(a1.description, '[')-2)
where string_1 = 'Added' and string_2 <> 'Supporting'
and not exists (
select * from audit_info a2
where a2.audited_id = a1.audited_id and string_1 = 'Removed'
and a2.description = a1.description )
and not exists ( -- here matching docs are eliminated
select 1 from package_r pr
where pr.id = a1.audited_id and pr.doc_list = dc.id ) ),
p as (
select ps.id, ps.pkgname, pr.doc_list, pr.position,
row_number() over (partition by ps.id order by doc_list) rn
from package_s ps
join package_r pr on pr.id = ps.id
where not exists ( select * from docs where pr.doc_list = docs.id )
)
select p.id, p.pkgname, p.doc_list, p.position
, ai.docname, ai.doc_id
from p join ai on ai.id = p.id and p.rn = ai.rn
order by p.id, p.doc_list, ai.doc_id
Output:
ID PKGNAME DOC_LIST POSITION DOCNAME DOC_ID
-- ------- -------- -------- ------- ------
p1 000001 d3 -3 doc3 d3-new
p1 000001 d4 -4 doc4 d4-new
p2 000002 d5 -2 doc5 d5-new
p4 000004 d6 -1 doc6 d6-new
Edit: Answers to issues reported in comments
it is identifying packages that do not have bad values, and then the doc_list column is blank,
Note that query (my subquery p) for identyfing packages is basically your query, I just added counter there.
I guess that some process/application or someone manually cleared column doc_list in package_r.
If you don't want such entries, just add condition and trim(doc_list) is not null in subquery p.
for the ones it gets right on the package part (they have a bad value) it is bringing back the wrong docname/doc_id to replace the bad value with, it is a different doc_id in the list.
I understand this only partially. Can you add such entries to your examples (in Fiddle or just edit your question and add problematic input rows and expected output for them?)
"It doesn't matter which new id goes into which position value".
Assignment I made this way - if we had two old docs with names "ABC", "DEF" and corrected docs have names "XXA", "DE12"
then they will be linked as "ABC"->"DE12" and "DEF"->"XXA" (alphabetical ordering seems more rational than totally random).
To make assigning random change order by ... to order by null in both row_number() functions.

How can I set and change the value in one field from query? Can I get the exact results some other way? - Access 2010

I have two tables. One is the -reqtable- in which i have
ID_request, PK
ID_Student,
ID_Professor,
Date,
ID_TypeofReq 1 meaning 'change mentor' , 3 meaning ' rejected by mentor'
ID_approved-rejected 1 menains approved , 2 meaning rejected
The other table is STUDENTStable:
ID_STUDENT as primary key
ID_Professor acting like a foreign key but not directly connected to any table.
The idea is to get the current mentor for each student. Students can change mentors and I want to always have the current one. I tried to achieve that with query but I didn't get the right results jet.
So I got an idea to to make a query that will update the ID_professor in the table STUDENTStable and I want the query to be that connection.
Type of request-> ID_typeofreq can be 1 meaning 'choosing mentor' and if the professor approves (ID_approved-rejected = 1) then the student has its mentor.
Type of request can also be 3, which means that mentor was mentoring the student but doesn't want to mentor the student anymore (the id_approved-rejected in that case is always 1 meaning approved). If that is the last entry than the student doesn't have the mentor anymore and should not be in the result of 'currently mentoring query'. But the student can later choose a new mentor and if is accepted he will get into the result 'currently mentoring query' again
The two tables are not connected, but its no problem to join them if will do the job.
SELECT a.ID_request,
a.ID_Student,
a.ID_Professor,
a.Date,
a.ID_typeofreq
FROM [reqtable] AS a
WHERE (((a.ID_typeofreq)=1 Or (a.ID_typeofreq)=3) AND ((a.ID_approved-rejected)=1)
AND ((a.Date)=(SELECT MAX(b.date)
FROM [reqtable] AS b
WHERE b.ID_request = a.ID_request)))
ORDER BY a.ID_Student DESC;
I need the code that will catch the last entry and if the type of request is 1 and ID_ aprooved-rejected = 1 to put the new id_professor in the STUDENTStable
And if the last ID_typeofreq is 3 and ID_aprooved-rejected = 1 to set the value of ID_Professor to Null again.

Find records that do not have a specific value attached

I am fairly new to SQL and I'm completely stuck in this query.
I have written a query for a health organization that pulls patients info, their appointment/which provider seen the patient, what location they were seen at and what documents were created during that visit and a signoff status.
When the results return, that part of the query is accurate. (Many rows for 1 patient because of many different types of documents were generated in that one appointment)
What I am looking for, however, is to have a list of patients that have had an appointment yet "X" document was NOT created. I dont want to see the list of other documents created; specifically the appointments where "X" was not created.
Please help!
This is the query that I first explained showing the patient/appointment info, and specific documents that have been generated.
SELECT prs.person_nbr, prs.last_name + ', '+ prs.first_name as Patient, prs.date_of_birth
, pm.description AS Provider
, lm.site_id
, appts.appt_type, appts.appt_date
, pd.document_desc, pd.signoff_status
FROM person prs
, patient_documents pd
, location_mstr lm
, provider_mstr pm
, appointments appts
WHERE prs.person_id = appts.person_id
and appts.enc_id = pd.enc_id
and appts.location_id = lm.location_id
and lm.site_id is not null
and appts.rendering_provider_id = pm.provider_id
and ((pd.document_desc IN ('bms_master_im', 'Master_Im', 'BMS_ob_master','BMS_GYN_master', 'BMS_bh_master')
and pd.signoff_status = 'P')
OR (pd.document_desc IN ('bms_master_im', 'Master_Im', 'BMS_ob_master', 'BMS_GYN_master', 'BMS_bh_master')
AND pd.signoff_status IS NULL ))
ORDER BY lm.site_id, Provider
Additional info:
Tried everything suggested.
To elaborate, here's an example.
Patient: JOHN SMITH
Appointment: 20111201
Documents created: work_letter, lab_req, master_im (master_im is one of the listed to be included in the results by the IN clause)
Patient: JANE WILCOX
Appointment: 20120704
Documents created: lab_results, test_action, immunization_record.
For JOHN SMITH, the results would show that he has a master_im. Based on the already written query.
For JANE WILCOX, the results would exclude her since her documents are not listed in the IN clause. However, I want to see her in the results so that I see that she does not have a master_im or any of the other documents listed in the IN clause created.
What i'm looking for in the end is to exclude JOHN SMITH because his appointment does have the master_im document created; and to include JANE WILCOX because she has an appointment that does not have the master_im. (master_im can be substituted by any of the document values in the IN clause of my query)
***All patients who have an appointment that a) does not have "x" document created, or b) does have the "x" document created but the signoff status = P or NULL. (B is easy to figure out, only need help with part a). And we want to ignore all other documents that are not included in x.
x= 'bms_master_im', 'master_im',' bms_ob_master', 'bms_gyn_master', 'bms_bh_master', 'ob_master', 'gyn_master', 'bh_master')
You need to expand your WHERE clause by adding a condition on the field that contain the "X" values you speak of. Assuming this field is located in patient_documents.code, your WHERE clause becomes:
WHERE prs.person_id = appts.person_id
and appts.enc_id = pd.enc_id
and appts.location_id = lm.location_id
and lm.site_id is not null
and appts.rendering_provider_id = pm.provider_id
and pd.code not in ('X', '...')
and ((pd.document_desc IN ('bms_master_im', 'Master_Im', 'BMS_ob_master','BMS_GYN_master', 'BMS_bh_master')
and pd.signoff_status = 'P')
OR (pd.document_desc IN ('bms_master_im', 'Master_Im', 'BMS_ob_master', 'BMS_GYN_master', 'BMS_bh_master')
AND pd.signoff_status IS NULL ))
Note that I added the line with the whitespace on top and below. Of course, if the field containing the code is named differently, you should modify the field reference.
Edited:
You need to use JOIN clauses on your tables, and a left join on the documents table.
For Example:
FROM person prs,
JOIN appointments appts ON prs.person_id = appts.person_id
JOIN location_mstr lm ON appts.location_id = lm.location_id
JOIN provider_mstr pm ON ...
Then
LEFT JOIN patient_documents pd ON ...

outer query to list only if its rowcount equates to inner subquery

Need help on a query using sql server 2005
I am having two tables
code
chargecode
chargeid
orgid
entry
chargeid
itemNo
rate
I need to list all the chargeids in entry table if it contains multiple entries having different chargeids
which got listed in code table having the same charge code.
data :
code
100,1,100
100,2,100
100,3,100
101,11,100
101,12,100
entry
1,x1,1
1,x2,2
2,x3,2
11,x4,1
11,x5,1
using the above data , it query should list chargeids 1 and 2 and not 11.
I got the way to know how many rows in entry satisfies the criteria, but m failing to get the chargeids
select count (distinct chargeId)
from entry where chargeid in (select chargeid from code where chargecode = (SELECT A.chargecode
from code as A join code as B
ON A.chargecode = B.chargeCode and A.chargetype = B.chargetype and A.orgId = B.orgId AND A.CHARGEID = b.CHARGEid
group by A.chargecode,A.orgid
having count(A.chargecode) > 1)
)
First off: I apologise for my completely inaccurate original answer.
The solution to your problem is a self-join. Self-joins are used when you want to select more than one row from the same table. In our case we want to select two charge IDs that have the same charge code:
SELECT DISTINCT c1.chargeid, c2.chargeid FROM code c1
JOIN code c2 ON c1.chargeid != c2.chargeid AND c1.chargecode = c2.chargecode
JOIN entry e1 ON e1.chargeid = c1.chargeid
JOIN entry e2 ON e2.chargeid = c2.chargeid
WHERE c1.chargeid < c2.chargeid
Explanation of this:
First we pick any two charge IDs from 'code'. The DISTINCT avoids duplicates. We make sure they're two different IDs and that they map to the same chargecode.
Then we join on 'entry' (twice) to make sure they both appear in the entry table.
This approach gives (for your example) the pairs (1,2) and (2,1). So we also insist on an ordering; this cuts to result set down to just (1,2), as you described.