How to use second table to assign dummy variable? - sql

I have two data sets. One is a multiple-variables one including an id variable (more than one entry have the same id), the second is vector of distinct id numbers.
I want to update the first data set assigning value 1 if the entry has an id that is in the second dataset.
Is merging the two the best way to do this?
Or is there a way such as
UPDATE `directory.dataset_1`
SET dummy = IF(id IN dataset_2.id =1,1, 0)
WHERE TRUE;
If the problem would have been solved in R a toy example would be:
dataset_1 <-
data.frame(c("000","001","010","011","000"),c("a","b","c","d","e"))
names(dataset_1) <- c("id","other")
dataset_2 <- data.frame(c("000","001"))
names(dataset_2) <- c("id")
result <- data.frame(c("000","001","010","011","000"),c("a","b","c","d","e"),
c(1,1,0,0,1))
names(result) <- c("id","other","dummy")

Logic need to go in the WHERE not in the SET, try the below
(Note: I'm unsure from your script if the ID needs to be =1 in dataset_2, so remove the WHERE from the sub query if not);
UPDATE dataset_1
SET dummy = 1
WHERE ID IN (SELECT ID
FROM dataset_2
WHERE ID = 1)

Related

SQL Server: Selecting Specific Records From a Table with Duplicate Records (Excluding Stale Data from a Query)

I'm trying to put together a query (select preferably) in SQL server that works with a single table. Said table is derived from two sets of data. Records where SET = OLD represent old data, records where SET = NEW represent new data. My intention is as follows:
If record CODE = A, keep/include the record.
If record CODE = C, keep/include the record but delete/exclude the corresponding record from the old set under the same ACT value.
If record CODE = D, delete/exclude it along with its corresponding record from the old set under the same ACT value.
If CODE = '' (blank/null), keep the record but only if it exists in the OLD set (meaning their isn't a corresponding record from the new set with the same ACT value)
What the table looks like before logic is applied:
ACT|STATUS |CODE|SET|VALUE
222| | |OLD|1
333| | |OLD|2
444| | |OLD|3
111|ADDED |A |NEW|4
222|CHANGED|C |NEW|5
333|DELETED|D |NEW|6
What the table should look like after logic is applied (end result)
ACT|STATUS |CODE|SET|VALUE
444| | |OLD|3
111|ADDED |A |NEW|4
222|CHANGED|C |NEW|5
While I can probably put together a select query to achieve the end result above I doubt it will run efficiently as the table in question has millions of records. What is the best way to do this without taking a long time to obtain the end result?
Something like this. you will have to split your query and union.
--Old Dataset
SELECT O.*
FROM MyTable O
LEFT JOIN Mytable N ON O.ACT = N.ACT AND N.[SET] = 'NEW'
WHERE O.[SET] ='OLD'
AND ISNULL(N.CODE,'A') = 'A'
UNION
-- New records
SELECT N.*
FROM MyTable N
WHERE N.[SET] ='NEW'
AND CODE <> 'D'

How to replace a value in one field by a value from another field (different column) within the same view table (SQL)?

I'd like to know if it is possible to replace a value from one field by using a value from another column and a different row.
For Example, click this to view the table image.
I'd like the SumRedeemed value 400 in row 15 to be replaced by the value -1*(-395); the value -395 comes from the EarnPointsLeft in row 6 (both have the same CID meaning that they are the same person). Any suggestions?
You need this update statement:
update t
set t.sumredeemed = (-1) * (select earnpointsleft from FifoPtsView where cid = t.cid)
from FifoPtsView t
where t.cid = 5000100008 and t.earnpointsleft = 0
This will work if the select statement will return only 1 row.
you can simply update your table
update t
set t.sumredeemed = ABS(t2.earnpointsleft )
from FifoPtsView t join FifoPtsView t2 on t.cid = t.cid and isnull(t2.earnpointsleft,0)>0
if you want negative values you can remove ABS ,
please give me your feedbacks

Take MIN EFF_DT and MAX_CANC_dt from data in PIG

Schema :
TYP|ID|RECORD|SEX|EFF_DT|CANC_DT
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
Suppose i have multiple records like this. I only want to display records that have minimum eff_dt and maximum cancel date.
I only want to display just This 1 record
DMF|1234567|98765432|M|2011-04-30|9999-12-31
Thank you
Get min eff_dt and max canc_dt and use it to filter the relation.Assuming you have a relation A
B = GROUP A ALL;
X = FOREACH B GENERATE MIN(A.EFF_DT);
Y = FOREACH B GENERATE MAX(A.CANC_DT);
C = FILTER A BY ((EFF_DT == X.$0) AND (CANC_DT == Y.$0));
D = DISTINCT C;
DUMP D;
Let's say you have this data (sample here):
DMF|1234567|98765432|M|2011-08-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMF|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-12-30|9999-12-31
DMX|1234567|98765432|M|2011-04-30|9999-12-31
DMX|1234567|98765432|M|2011-04-01|9999-12-31
Perform these steps:
-- 1. Read data, if you have not
A = load 'data.txt' using PigStorage('|') as (typ: chararray, id:chararray, record:chararray, sex:chararray, eff_dt:datetime, canc_dt:datetime);
-- 2. Group data by the attribute you like to, in this case it is TYP
grouped = group A by typ;
-- 3. Now, generate MIN/MAX for each group. Also, only keep relevant fields
min_max = foreach grouped generate group, MIN(A.eff_dt) as min_eff_dt, MAX(A.canc_dt) as max_canc_dt;
--
dump min_max;
(DMF,2011-04-30T00:00:00.000Z,9999-12-31T00:00:00.000Z)
(DMX,2011-04-01T00:00:00.000Z,9999-12-31T00:00:00.000Z)
If you need to, change datetime to charrary.
Note: there are different ways of doing this, what I am showing, except the load step, it produces the desired result in 2 steps: GROUP and FOREACH.

Removing duplicate pair in Pig

I am having the below sample
Update:
OBR|1|METABOLIC PANEL
OBX|1|Glucose
OBX|2|BUN
OBX|3|CREATININE
OBR|2|RFLX TO VERIFICATION
OBX|1|EGFR
OBX|2|SODIUM
OBR|3|AMBIGUOUS DEFAULT
OBX|1|POTASSIUM
In this sample consider all the OBR as one Test and every OBR is followed by OBX which is the result of the OBR. Every OBR is followd by id (such as 1,2 and 3 in this case) all the OBX of a particular OBR starts with 1. SO what i was thing is if i found one OBR i'll create one unique id and put it in all the OBX followed by the OBR untill i reach the OBR with id 2 again i'll do the same.
Below is my expected output.
Expected Result :
OBR|1|METABOLIC PANEL|OBR_filename_1
OBX|1|Glucose|OBR_filename_1
OBX|2|BUN|OBR_filename_1
OBX|3|CREATININE|OBR_filename_1
OBR|2|RFLX TO VERIFICATION|OBR_filename_2
OBX|1|EGFR|OBR_filename_2
OBX|2|SODIUM|OBR_filename_2
OBR|3|AMBIGUOUS DEFAULT|OBR_filename_3
OBX|1|POTASSIUM|OBR_filename_3
Use DISTINCT.Assuming you have relation A with duplicate records.The below statement will remove duplicate records and store the unique records in relation B
B = DISTINCT A;
I tried this, it looks like a HL file. you can use Stitch, Over & Lead and come up with something like this. Probably there might be a better solution than this from a performance standpoint. But this should work I guess, please let me know how it goes.
DEFINE Over org.apache.pig.piggybank.evaluation.Over('long');
DEFINE Stitch org.apache.pig.piggybank.evaluation.Stitch;
DEFINE lead org.apache.pig.piggybank.evaluation.Lead;
in = LOAD 'hl_file' using PigStorage('|') as (id:chararray, num:int, reason:chararray);
temp = rank in;
ranked = foreach temp generate $0 as row_no, $1 as id:chararray, $2 as orig_id:int, $3 as reason:chararray;
OBR_data = FILTER ranked by id == 'OBR';
next_row_num_OBR = FOREACH (group OBR_data by id) {
sorted = ORDER OBR_data by row_no;
stitched = Stitch(sorted, Over(sorted.row_no, 'lead',0,1,1,(long)9999));
generate flatten(group) as (id:chararray),
flatten(stitched.(row_no, orig_id, reason, result)) as (row_no:long, orig_id:int, reason:chararray, next_row_no:long);
}
OBX_data = FILTER ranked by id == 'OBX';
Crossed = CROSS next_row_num_OBR, OBX_data;
result = FILTER Crossed BY (OBX_data::row_no > next_row_num_OBR::row_no and OBX_data::row_no < next_row_num_OBR::next_row_no);
This should produce something like this:
(OBR,5,2,RFLX TO VERIFICATION,8,7,OBX,2,SODIUM)
(OBR,1,1,METABOLIC PANEL,5,2,OBX,1,Glucose)
(OBR,5,2,RFLX TO VERIFICATION,8,6,OBX,1,EGFR)
(OBR,8,3,AMBIGUOUS DEFAULT,9999,9,OBX,1,POTASSIUM)
(OBR,1,1,METABOLIC PANEL,5,3,OBX,2,BUN)
(OBR,1,1,METABOLIC PANEL,5,4,OBX,3,CREATININE)
Instead of file name or a constant, it just adds the OBR record to its corresponding OBXs.

Creating filter with SQL queries

I am trying to create a filter with SQL queries but am having trouble with numeric values linking to other tables.
Every time I try to link to another table, it takes the same record and repeats it for every element in the other table.
For example, here is query:
SELECT ELEMENTS.RID,TAXONOMIES.SHORT_DESCRIPTION,[type],ELEMENT_NAME,ELEMENT_ID,SUBSTITUTION_GROUPS.DESCRIPTION,namespace_prefix,datatype_localname
FROM ELEMENTS,SUBSTITUTION_GROUPS,TAXONOMIES,SCHEMAS,DATA_TYPES
WHERE ELEMENTS.TAXONOMY_ID = TAXONOMIES.RID AND ELEMENTS.ELEMENT_SCHEMA_ID = SCHEMAS.RID AND
ELEMENTS.DATA_TYPE_ID = DATA_TYPES.RID
AND ELEMENTS.SUBSTITUTION_GROUP_ID = 0
The last line is the actual filtering criteria.
Here is an example result:
There should only be ONE result (Item has an RID of 0). But it's repeating a copy of the one record for every result inside the substitution groups table (there's 4).
Here is my database schema for reference. The lines indicate relationships between tables and the circles indicate the values I want:
You're forgot to join between ELEMENTS and SUBSTITUTION_GROUPS in your query.
SELECT
ELEMENTS.RID,TAXONOMIES.SHORT_DESCRIPTION,[type],ELEMENT_NAME,ELEMENT_ID,SUBSTITUTION_GROUPS.DESCRIPTION,namespace_prefix,datatype_localname
FROM
ELEMENTS,SUBSTITUTION_GROUPS,TAXONOMIES,SCHEMAS,DATA_TYPES
WHERE
ELEMENTS.TAXONOMY_ID = TAXONOMIES.RID AND ELEMENTS.ELEMENT_SCHEMA_ID = SCHEMAS.RID
AND ELEMENTS.DATA_TYPE_ID = DATA_TYPES.RID
AND ELEMENTS.SUBSTITUTION_GROUP_ID = SUBSTITUTION_GROUPS.RID
AND ELEMENTS.SUBSTITUTION_GROUP_ID = 0