join operation difference between when distinct is 1 and n? - sql

let say I have a df and df2.
if I want to join this two tables.
df : name , classId
df2: classId, time
a) df.classid.distinct().count() = 1
b) df.classid.distinct().count() = n , n < 500
c) df.classid.distinct().count() = n , n > 100000
if I want to make a join operation. it will be different for this 3 senarios?

A join does not depend on the number of rows in a table. It's always written in the same way. In your case you are looking for:
select df.name, df.classid, df2.time
from df
join df2 on df.classid = df2.classid;

Related

Subsetting Data Using Proc SQL

I am fairly new to SAS and am working on a sorting exercise to improve my SAS skills, but I seem to keep getting stuck since this dataset has observations with different date ranges.
I was given generated admission and discharge data for patients who visited two different hospitals. The data is sorted on admission date and is contained in one dataset. My goal is to create two datasets from this large dataset. The first dataset should contain the patient ID's for those patients who went to hospital A prior to visiting hospital B. The second data set should contain the patient ID's for those patients who went to hospital B prior to visiting hospital A. A sample of the main dataset looks like this:
ID Hospital Admission_Date Discharge_Date
1 A 21AUG2018 24AUG2018
1 A 02OCT2019 07OCT2019
1 B 07OCT2019 17OCT2019
2 B 01AUG2020 13AUG2020
2 A 28SEP2020 30SEP2020
3 B 17MAY2019 18MAY2019
3 A 18MAY2019 21MAY2019
3 B 21MAY2019 31MAY2019
The two resulting datasets should only include the patient ID's.
For instance, for the datasets where patients went from Hospital A to Hospital B we should have something like this:
ID
1
For the cases where patients went from Hospital B to Hospital A, we should have something like this:
ID
2
3
Any help on this would be greatly appreciated!
In SQL, it would look like this:
proc sql;
create table a_to_b as
select distinct a.id
from have as a
, have as b
where a.id = b.id
AND a.hospital = 'A'
AND b.hospital = 'B'
AND b.admission_date GE a.discharge_date
;
create table b_to_a as
select distinct a.id
from have as a
, have as b
where a.id = b.id
AND a.hospital = 'A'
AND b.hospital = 'B'
AND a.admission_date GE b.discharge_date
;
quit;
The data step version only requires one pass. It assumes that your data is already sorted in the correct order and compares the previous row to the current row. If there is any ID that goes from A to B or B to A, we set a flag for that ID to 1 and stop comparing any further. When we reach the last value of that ID, we output to the appropriate dataset.
data a_to_b
b_to_a
;
set have;
by id;
retain flag_a_to_b flag_b_to_a;
lag_hospital = lag(hospital);
lag_discharge_date = lag(discharge_date);
if(first.id) then call missing(of lag:, of flag:);
if(flag_a_to_b < 1) then flag_a_to_b = ( hospital = 'B'
AND lag_hospital = 'A'
AND admission_date GE lag_discharge_date
)
;
if(flag_b_to_a < 1) then flag_b_to_a = ( hospital = 'A'
AND lag_hospital = 'B'
AND admission_date GE lag_discharge_date
)
;
if(last.id AND flag_a_to_b) then output a_to_b;
if(last.id AND flag_b_to_a) then output b_to_a;
keep id;
run;
How we arrived at the SQL code
SQL in SAS cannot do lags, so instead we do an inner join on both IDs, but get all combinations of admissions and discharges between hospitals A and B. It looks like this:
From this table, we know that:
hospital_a must be 'A' and hospital_b must be 'B'
The admission date from one hospital must be >= the discharge date from the other hospital
Knowing that, we arrive at the following where clauses:
A to B:
a.id = b.id
AND a.hospital = 'A'
AND b.hospital = 'B'
AND b.admission_date GE a.discharge_date
B to A:
a.id = b.id
AND a.hospital = 'A'
AND b.hospital = 'B'
AND a.admission_date GE b.discharge_date

A join that "allocates" from available rows

I have a table X I want to update according to the entries in another table Y. The join between them is not unique. However, I want each entry in Y to update a different entry in X.
So if I have table X:
i (unique) k v
---------- ---------- ----------
p 100 b
q 101 a
r 202 x
s 301 a
and table Y:
k (unique) v
---------- ----------
0 a
1 b
2 a
3 c
4 a
I want to end up with table X like:
i k v
---------- ---------- ----------
p 1 b
q 0 a
r 202 x
s 2 a
The important result here is that the two rows in X with v = 'a' have been updated to two distinct values of k from Y. (It doesn't matter which ones.)
Currently, this result is achieved by an extra column and a program roughly like:
UPDATE X SET X.used = FALSE;
for Yk, Yv in Y:
UPDATE X
SET X.k = Yk,
X.used = TRUE
WHERE X.i IN (SELECT X.i FROM X
WHERE X.v = Yv AND NOT X.used
LIMIT 1);
In other words, the distinctness is achieved by "using up" the rows in Y. This doesn't scale well.
(I'm using SQLite3 and Python, but don't let that limit you.)
This can be solved by using rowids to pair up the results of a join. Window functions aren't necessary. (Thanks to xQbert for pointing me in this direction.)
First, we sort the two tables by v to make tables with rowids in a suitable order for the join.
CREATE TEMPORARY TABLE Xv AS SELECT * FROM X ORDER BY v;
CREATE TEMPORARY TABLE Yv AS SELECT * FROM Y ORDER BY v;
Then we can pick out the minimum rowid for each value of v in order to create a "zip join" for that value, pairing up the rows.
SELECT i, Yv.k, Xv.v
FROM Xv JOIN Yv USING (v)
JOIN (SELECT v, min(Xv.rowid) AS r FROM Xv GROUP BY v) AS xmin USING (v)
JOIN (SELECT v, min(Yv.rowid) AS r FROM Yv GROUP BY v) AS ymin
ON ymin.v = Xv.v AND Xv.rowid - xmin.r = Yv.rowid - ymin.r;
The clause Xv.rowid - min.x = Yv.rowid - min.y is the trick: it does a pairwise match of rows with the same value of v, essentially allocating one to the other. The result:
i k v
---------- ---------- ----------
q 0 a
s 2 a
p 1 b
It's then a simple matter to use the result of this query in an UPDATE.
WITH changes AS (<the SELECT above>)
UPDATE X SET k = (SELECT k FROM changes WHERE i = X.i)
WHERE i IN (SELECT i FROM changes);
The temporary tables could be restricted to the common values of v and possibly indexed on v if the query is large.
I'd welcome refinements (or bugs!)

Execute query for each pair of values from a list

I have a list of value pairs over which I iterate and run a query, the skeleton of which could be thought of like this.
list of pairs - ((x1,y1), (x2,y2), ... (xn,yn)) xi, yi are not all distinct.
q is an oracle query which returns a single value for any (xi,yi)
global_table is a single row table with
id col deleted
1 Y NULL
A few rows from 'table':
id col deleted pid did
1 NULL Y 25 1
81 N NULL NULL 149
101 Y NULL 22 149
61 Y NULL NULL NULL
Also, there is a UNIQUE constraint on (pid, did, deleted) in table.
The query q goes like this.
select w.finalcol from
(select coalesce(a.col,b.col,c.col,d.col) as finalcol from
(select * from global_table where deleted is null) a
left outer join
(select * from table where deleted is null) b
on b.pid is null and b.did is null
left outer join
(select * from table where deleted is null) c
on c.pid is null and c.did = xi
left outer join
(select * from table where deleted is null) d
on d.pid = yi and d.did = xi
) w
n = 60
n is determined by another query which returns the list of value pairs.
for element in (list of pairs)
q(xi,yi) (xi and yi might be used any number of times in the query)
I am trying to reduce the number of times I run this query. (from n times to 1)
I can try passing the individual lists and to the query after isolating them from the list of pairs but the catch is that not all pairs are present in the table(s) being queried from. But, you do get a value from the table for pairs that dont exist in the table(s) since there is a default case at play here (not important).
The default case is when
select * from table where deleted is null
and c.pid is null and c.did = xi
select * from table where deleted is null
and c.pid = yi and c.did = xi
dont return any rows.
I want my result to be of the form
x1 y1 q(x1,y1)
x2 y2 q(x2,y2)
.
.
.
xn yn q(xn,yn)
(no pair must be left out, given that a few pairs might actually not be in the table)
How do I achieve this using just a single query?
Thanks

sql query with comparison but without removing without subquerying

my question is, is it possible to select certain rows in a table according to a comparison rule without removing anything from the result. To clarify what i want to to imagine following example.
i have a table with two values,
A | B | C
1 0 hey
1 1 there
2 1 this
3 0 is
3 1 a
4 0 test
now i want to select the rows that have a 0 in the B column, and an a in the C column without removing the results that don't have a 0 in column B but the same value in column A.
For that i could do a
select C from T where A in (select A from T where B = 0);
but isn't it possible to select all C values where column B contains a 0 and that match column A with those?
I'd gladly stand by if more information is needed since it is a quite fuzzy question, but SQL can be confusing sometimes.
Tough to tell without your example result set; but maybe something like this:
SELECT A, B, C
FROM myTable
WHERE (B = 0 AND C LIKE '%A%')
OR (B <> 0 AND B = A)
I think you just want an or condition:
select C
from MyTable
where b = 0 or A in (select A from T where B = 0)
Is this the version you want:
select C
from MyTable
where C = 'a' or A in (select A from T where B = 0)

Searching Multiple Rows at a time through a single SQL query

I have a table whose data is in this manner.
A B C
---------
0 6 2
0 3 4
1 0 2
1 1 4
I wrote a SQL query -
select A
from Table
where (B = 6 and C = 2) AND (B = 3 and C = 4).
Obviously it returned zero results since this query would search in the same row. Please help me with writing a better one to produce results such that it can check two rows with a single statement.
EDIT:
I am not looking for 'OR' statement. I need to find an element of A such that it has two corresponding rows AND each of the rows has elements 6,2 and 3,4 present in columns B,C correspondingly.
PS.
(I don't have the option of writing two queries and then finding the common elements of two set.)
Many thanks in advance
I guess you want something like this
select A
from YourTable
where (B = 6 and C = 2) or
(B = 3 and C = 4)
group by A
having count(distinct B) >= 2
Try here:
https://data.stackexchange.com/stackoverflow/q/123711/
Use OR instead of AND
select A from Table where (B=6 and C=2) OR (B=3 and C=4).
If you want the onlu result use DISTINCT
select DISTINCT A from Table where (B=6 and C=2) OR (B=3 and C=4).
If you need to check the equality of A, use this:
select t1.A
from Table t1
JOIN Table t2 ON t1.A = t2.A
where T1.B=6 and t1.C=2 AND t2.B=3 and t2.C=4
As you see - using AND again
Are you trying to get this??
SELECT A
FROM Table
WHERE (B = 6 AND C = 2) OR (B = 3 AND C = 4)
This would return the A column for all four rows again.
If not: WHAT exactly are you trying to select?
IF you want just two rows, one with A = 0, one with A = 1, then use DISTINCT:
SELECT DISTINCT A
FROM Table
WHERE (B = 6 AND C = 2) OR (B = 3 AND C = 4)
Maybe:
select A
from Table
where (B = 6 and C = 2)
INTERSECT
select A
from Table
(B = 3 and C = 4)